Hacker News new | past | comments | ask | show | jobs | submit | ianbutler's comments login

Hey guys we commented on another thread from a few days ago about our tool Bismuth finding the bug (along with a sha of our reproducer script for proof) https://news.ycombinator.com/item?id=43489944

The reproducer is linked in the post.

After disclosing and having correspondence with Gerlof and from the post on the front page right now https://news.ycombinator.com/item?id=43518560

It looks like we did in fact nail it and I've just shared our write up above on how we got it.


thank you, I dont use atop but probably should check all my servers, I am gonna check my custom routers they may have it.

Hey guys we commented on another thread from a few days ago about our tool Bismuth finding the bug (along with a sha of our reproducer script for proof) https://news.ycombinator.com/item?id=43489944

After disclosing and having correspondence with Gerlof and from his above post it looks like we did in fact nail it and I've just shared our write up on how we got it.

HN post detailing how we got it: https://news.ycombinator.com/item?id=43519522

Edit: Here's our reproducer and we've added it to the post too: https://gist.github.com/kallsyms/3acdf857ccc5c9fbaae7ed823be...


What is that a hash of?

As noted, our reproducer script

Right, but where’s the script?


Cool, thanks for adding it. It would also be nice if you posted how you generated the hash :) I’m not trying to be annoying but this is a critical part of how these hashes work; you post the hash early to indicate you have some information early and then later you demonstrate that by actually presenting the artifact with that hash. If you don’t publish the artifact so people can check that it is actually what you claim it is then your hash is worthless (as nobody can prove it’s not, like, the hash of a cat photo). And you’d generally want to demonstrate how you generated the hash just so people don’t have to figure out whether to md5 or sha1sum it.

Hey yeah got caught up in the excitement of finding it :)

It's a SHA256 - `shasum -a 256 server.py`


Hey Ian from Bismuth here, we see a lot of people talking about how "dumb" LLMs are and one thing I think people are still lagging behind on is integrating advanced code analysis into the agent loop.

For inner loop tools like Cursor this means using better static analysis and maybe test running.

For tools like ours on the outer loop and asynchronous you can lean on more advanced techniques like fuzzing to find bugs and use them as part of the agent loop to produce a higher quality artifact.

It definitely may take 5-10 minutes of extra time, but that isn't a problem when you can take an 20-40 minutes to finish a task.

So I think as a young industry we've barely scratched the tip of the iceberg on techniques to guide code generation and prevent unwanted outcomes for developers.

We've been using this very effectively as we've rolled out with early design partners and we also have some other content coming up very soon that cements the usefulness of this technique.


Hey guys so I work on a tool call Bismuth along with my co-founder for finding and fixing bugs and we think we have this. At the very least we have a bug in atop which mimics what is being described.

We're going to throw this sha down right here: 1d53b12f3bc325dcfaff51a89011f01bffca951db9df363e6b5d6233f23248a5

And now we're going to go responsibly disclose what we have to the maintainers.



We've reached out to the maintainer over e-mail.

Based on the bug you've found, do you think it's exploitable beyond DoS?

Thank you for doing this instead of just vagueposting and wasting everyone’s time.

I'm inclined to expect that we should put the blame for that on whomever used legal channels to force Rachel to shut up, although obviously the jury is still out until we know more.

This is the only explanation that makes some sense, otherwise it would be just a dick move for someone to hint at the presence of an exploitable bug but then not say what exactly it is.

Not telling you how to hack a whole bunch of computers until the bug is fixed is called responsible disclosure. It's very popular, and depending on how the government is feeling that day, may be illegal not to do.

I'm reading "I can go into why another time." like "I don't have time" personally, not like "I am not allowed to say".

Then you are overlooking two things that provide important context: her previous behavior in similar circumstances of discovering bugs, and the opening sentence:

> My life as a mercenary sysadmin can be interesting.

To me this reads as "I was hired as a consutant for something that required a very restrictive NDA."


Hi! Three things:

- There is no commit with a SHA1 like that in atop Git history and what you shared is too long for a SHA1, it looks more like a SHA256. Did you share the right checksum? The only other way I can read this is that it's a SHA256 checksum of one of the past atop release tarballs or artifacts. I have not yet checked those.

- I have tried finding your tool Bismuth but all I find is things KDE and crypto currencies. Please share a link to the Bismuth that you are working on.

- You technically said that you are working on Bismuth /and/ found something, not that you found the bug /through/ Bismuth. Please clarify if and how that was the case.

Thank you!


- That SHA is just a proof marker so if it turns out we are correct we can prove we had it at that time

- Bismuth did indeed find the bug, our bug scanning feature in particular. Obviously we're going to sit on our hands until the maintainer gives the all clear but we'll write something up after this is all squared away

- https://www.bismuth.sh is our tool, we're still relatively new


pretty sure it's just a hash of some text they can reveal later, to prove that they had something at this point in time. not referring to any release or commit

This is exactly correct

I see, thanks!


Hey this is sweet, we have our own internal version of this including graph search based on the call-graph which we use to great effect in our agent.

Glad to see people making something like this more generally available.


Thanks! If you ever wanna trade notes or if we can be of any help, feel free to reach out at ayman@nuanced.dev!


Ah optimistic locking. I implemented that on top of a radix tree to make a concurrent disk backed adaptive radix tree for text search in Rust.

I was looking at a blog post talking about the early days of Algolia’s search engine when I decided I wanted to try my own implementation of their algorithm.

https://github.com/iantbutler01/dart

dart = disk backed adaptive radix tree

The idea being hot paths stay in memory and cold paths get shunted to disk and since text search tends to have a pretty regular set of queries people make you get a really nice trade off for speed and storage.


Okay so this is a personal opinion right? Like where is the objectivity in your review?

What are the hardline performance characteristics being violated? Or functional incorrectness. Is this just "it's against my sensibilities" because at the end of the day frankly no one agrees on how to develop anything.

The thing I see a lot of developers struggle with is just because it doesn't fit your mental model doesn't make it objectively bad.

So unless it's objectively wrong or worse in a measurable characteristic I don't know that it matters.

For the record I'm not asserting it is right, I'm just saying I've seen a lot of critiques of LLM code boil down to "it's not how I'd write it" and I wager that holds for every developer you'll ever interact with.


OP didn't put much effort into writing the code so I'm certainly not putting in much effort into a proper review of it, for no benefit to me no less. I just wanted to see what quality AI gets you, and made a comment about it.

I'm pretty sure the code not having the "if (…) lexer->line++" in places is just a plain simple repeated bug that'll result in wrong line numbers for certain inputs.

And human-wise I'd say the simple way to not have made that bug would've been to make/change abstractions upon the second or so time writing "if (…) lexer->line++" such that it takes effort to do it incorrectly, whereas the linked code allows getting it wrong by default with no indication that there's a thing to be gotten wrong. Point being that bad abstractions are not just a maintenance nightmare, but also makes doing code review (which is extra important with LLM code) harder.


I agree, it seems a lot of the complaints boil down to academic reasons.

Fine it's not the best and perhaps may run into some longer term issues but most importantly it works at this point in time.

A snobby/academic equivalent would be someone using an obscure language such as COBOL.

The world continues to turn.


I use it everyday, it has to have good search and good static analysis built in.

You also have to be very explanatory with a direct communication style.

Our system imports the codebase so it can search and navigate plus we feed lsp errors directly to the LLM as development is happening.


> we feed lsp errors directly to the LLM as development is happening

Yeah I guess that would help a lot. Stuck with a bit more primitive tools here, so that doesn't help.


3.5 Sonnet Yes IC SWE (Diamond) N/A 26.2% $58k / $236k 24.5%

But sonnet solved over 25% of them and made 60 grand.

That's a substantial amount of work. I don't entirely disagree with you about it being premature but these things are clearly providing substantial value.


>But sonnet solved over 25% of them and made 60 grand.

Technically it didn’t since all these tasks were done some time ago. On that note, I feel like putting a dollar amount on the tasks it was able to complete is misleading.

In the real world, if a model masquerading as a human is only right 25% of the time, its reviews on Upwork would reflect that and it would never be able to find work ever again. It might make a couple thousand before it loses trust.

Of course things would be different if they were open and upfront about this being an LLM, in which case it would presumably never run out of trust.

And again, Expensify is an anomaly among companies in that it gives freelancers well articulated tasks to work on. The real world is much more messy.


That's a lot of qualifying you have to do to discount this which that's fine but my take is you do that at your own peril as we look to the future of this tech.

The real world is messy but the real world also adapts to the most cost effective solution even if it's just alright.

People will spend more time specifying their task for an LLM based tool if it gets the job done and costs a fraction of a freelancer.


We've built a pretty complex CLI using Ratatui and so far we like it a lot. Exciting at the web targets I wasn't aware of this and might make our lives a lot easier for something we want/need to do.


OP here, that's great! Let me know if you ever try out Ratzilla :)


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: