More

jedberg · 2026-03-04T22:46:55 1772664415

Curious why you chose Inngest over the other options for the stack? (Disclosure, my company makes one of the other options)

jedberg · 2026-03-03T17:01:55 1772557315

Not sure I agree (and I made the jump from IC to management).

Look at the parallel tracks. A VP is the same level as a distinguished engineer, roughly. To be a VP, you have to be a great manager and got lucky with a few big projects.

To be a DE, you basically have to be famous within the industry. And when I look at a large tech company, while there aren't a lot of VPs, usually the number of DEs is countable on one hand (or maybe two).

They are very different skill sets. You shouldn't choose your role based on money or career progression, you should choose based on what you love to do, because especially in this world of AI replacing all the "boring" work, the only people who will be left will be the ones passionate about what they are doing.

jedberg · 2026-03-03T01:04:08 1772499848

Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on).

An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".

It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.

Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".

Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.

This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.

Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.

emmelaich · 2026-03-03T04:20:18 1772511618

Regarding 2, I believe that talking on mobile phones drives older people crazy. They remember talking on normal land lines when there was almost no latency at all. The thing is -- they don't know why they don't like it.

sinuhe69 · 2026-03-03T05:53:32 1772517212

Yeah, I remember the time when we had to use satellites to connect. The long delay was really annoying and so unusual that most people without "training" could not even use the phone for conversation and just wasted the dollars.

jermaustin1 · 2026-03-03T13:04:26 1772543066

A former boss of mine took off to Everest for a month leaving me (a 22 year old, at the time) in charge of the office. I was out to dinner with my now wife when I got a call from a very long phone number I didn't recognize, so I ignored it. I then got another one right after, and picked it up. It was my boss, he needed me to log into his personal email to grab a phone number for the medical insurance he purchased for the trip, because he had been vomiting for days due to altitude sickness, and needed a medical evacuation.

That was the most stressfully hard to use phone call I've ever had. The delay was nearly 10 seconds, and eventually I just said I was only going to speak yes or no, if he needed a longer answer he needed to shut up. And that worked. We no longer talked over eachother.

BobaFloutist · 2026-03-03T15:53:09 1772553189

Maybe you bring back radio etiquette and just say "over" at the end of every thought?

47282847 · 2026-03-03T14:15:39 1772547339

> The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done.

This reminds me of a great diversity training at a previous employer, where we dug into the different expectations of when and how to take your turn in conversation and how that can create a lot of friction just from different cultural/familial habits. In my family, we’re expecting to talk over each other and it’s not offensive at all to do so, whereas some of my friends really get upset if we don’t take clear turns, a mode which would cause high levels of irritation in my family (and still do in me).

mywacaday · 2026-03-03T11:22:23 1772536943

No. 2 is interesting, our national lottery in Ireland has an app that you can scan the barcode on your ticket to check if you have won or not, at some stage they updated the app and the scan picks up the barcode even before you center it on the screen and tells you if you have lost/won instantly, I though it was my IT background that made me uncomfortable with it happening so fast, wonder what other examples like this exist where the result/action being too fast causes doubt with the user?

GrayShade · 2026-03-03T12:21:19 1772540479

The Signal device linking feature is just as fast. It's partly a trick -- it will look for QR codes even outside the central area, so under good conditions it can get a read before you even get a rough orientation.

nicktikhonov · 2026-03-03T01:14:33 1772500473

This is fascinating, thanks for sharing! I wonder why amazon/google/apple didn't hop on the voice assistant/agent train in the last few years. All 3 have existing products with existing users and can pretty much define and capture the category with a single over-the-air update.

jedberg · 2026-03-03T01:28:14 1772501294

Two main reasons:

1. Compute. It's easy to make a voice assistant for a few people. But it takes a hell of a lot of GPU to serve millions.

2. Guard Rails. All of those assistants have the ability to affect the real world. With Alexa you can close a garage or turn on the stove. It would be real bad if you told it to close the garage as you went to bed for the night and instead it turned on the stove and burned down the house while you slept. So you need so really strong guard rails for those popular assistants.

3 And a bonus reason: Money. Voice assistants aren't all the profitable. There isn't a lot of money in "what time is it" and "what's the weather". :)

mcbits · 2026-03-03T01:35:33 1772501733

> There isn't a lot of money in "what time is it" and "what's the weather". :)

- Alexa, what time is it?

- Current time is 5:35 P.M. - the perfect time to crack open a can of ice cold Budweiser! A fresh 12-pack can be delivered within one hour if you order now!

jedberg · 2026-03-03T01:39:10 1772501950

If your Alexa did that, how quickly would you box it up and send it to me. :)

I am serious though about having it sent to me: if anyone has an Alexa they no longer want, I'm happy to take it off your hands. I have eight and have never bought one. Having worked there I actually trust the security more than before I worked there. It was basically impossible for me, even as a Principle Engineer, to get copies of the Text to Speech of a customer and I literally never heard a customer voice recording.

stavros · 2026-03-03T03:12:51 1772507571

I'm puzzled by this conversation, because Amazon did get on the agent bandwagon with Alexa Plus (I have it, it's buggier than regular Alexa and it's all making me throw my Echos away since they can't even play Spotify reliably).

Also, my Alexa does advertise stuff to me when I talk to it. It's not Budweiser, but it'll try to upsell me on Amazon services all the time.

llbbdd · 2026-03-03T05:13:24 1772514804

I upgraded to Alexa+ and initially hated it but I've kept it because it's sooo much better at some things. This last December I bought a handful of smart plugs for my Christmas lights all around the house, and I did almost all the setup trivially over voice, e.g. fuzzy run-on stuff like this just worked on the first try:

- "Alexa, name the new unnamed outlet 'Living Room Lights', and the other unnamed one 'Stair Lights', then add them to a new group called 'Christmas Lights', and add the other three outlets as well"

- "Alexa, create a routine to turn off all the Christmas lights if there's nobody in the room and it's after 11pm"

- "Alexa, turn off all the Christmas lights except the tree in this room and the mantle"

That same fuzziness has definitely fucked up things that used to work more reliably like music playback though. Sometimes it works when I fall back to giving it more "robotic" commands in those cases but not always. They've also gone completely overboard with the cutesy responses because it's so trivial to do now ("I've set your spaghetti sauce timer for ten minutes. Happy to help with getting this evening's Italian-inspired dinner ready!")

stavros · 2026-03-03T10:30:35 1772533835

Hm yeah, that's helpful. For me it'll randomly stop or stutter when playing Spotify, it'll randomly not answer commands, it'll refuse to listen and let some other Alexa in another room reply, it's super janky.

I only use it for music, and use two commands, but apparently having this work correctly is too much to ask for these days.

jedberg · 2026-03-03T03:18:38 1772507918

> because Amazon did get on the agent bandwagon with Alexa Plus

Which just launched last year, about four years after ChatGPT had AI voice chat. And it costs extra money to cover the costs. And as you aptly point out, all the guardrails they had to put in made the experience less than ideal.

> Also, my Alexa does advertise stuff to me when I talk to it.

Yes, that is how they try to make money. And it's gotten worse. But how many times does it get you to buy something?

ghrl · 2026-03-03T06:30:24 1772519424

I would say that depends. When it tries to upsell Prime subscriptions into even more Amazon subscriptions I always interrupt it and say the command again so it stops, but a few times it told me "this item in your cart is on sale by some %" and that did make me buy the item.

derangedHorse · 2026-03-03T12:16:00 1772540160

Alexa Plus sucks. It takes way too long to respond even when given simple commands. I either had to turn it off or trash my Echo. Luckily there was an option to turn it off, but Amazon is on thin ice with me.

stavros · 2026-03-03T12:20:14 1772540414

I agree, I can't wait for the trial to end.

vidarh · 2026-03-03T12:42:09 1772541729

I already swear at mine when it tries to suggest setting up a routine for me or otherwise fail to just immediately shut up after answering my query.

Still not boxing them up. Though I now have a Pi with a HomeAssistant setup I'm trialling, so maybe that'll change.

alexastoplying · 2026-03-03T02:58:04 1772506684

What a way to throwaway good will. I also worked there and to get access to text you simply had to grab the DSN of your device, attest that it’s yours and it gets put in a “pool” of devices that are tracked until removed. On each end you are basically waved through with no checks. This was usually done when debugging tricky UI bugs or new features as the request followed through several micro services. I do not believe the a PE would not know this. And one with patents.

jedberg · 2026-03-03T03:09:19 1772507359

That was your own device. Not other customers.

argee · 2026-03-03T03:29:38 1772508578

Don't feed the trolls, Jeremy.

jedberg · 2026-03-03T03:37:19 1772509039

But they're hungry!

aabdi · 2026-03-03T08:38:15 1772527095

it was too hard~, they all tried real hard and the models just kept failing. The models only got good enough -1.5 years ago~.

I mean its deployed now (Alexa+/gemini). but its expensive as hell. and also kinda useless. Claude cowork/clawbot form factors are better.

Wrong form factor/use case really. People really wanna buy stuff using clawbot.

garettmd · 2026-03-03T15:37:51 1772552271

> It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

that's super interesting. do you know of any resources to learn more about this phenomenon?

ismailmaj · 2026-03-03T08:50:31 1772527831

Semantic end of turn being 300ms of silence is horrible because I ended up intentionally um-ing to finish my thoughts before getting answer.

It was difficult to detrain and that made me stop using voice chat with LLMs all together.

mungoman2 · 2026-03-03T06:44:22 1772520262

I think you’re implying that it would be useful to have the LLM predict the end of the speaker’s speech, and continue with its reply based on that.

If, when the speaker actually stops speaking, there is a match vs predicted, the response can be played without any latency.

Seems like an awesome approach! One could imagine doing this prediction for the K most likely threads simultaneously, subject by computer power available, and prune/branch as some threads become inaccurate.

YesBox · 2026-03-03T04:57:59 1772513879

Why dont voice assistants use a finishing word or sound?

People are already trained to say a name to start. Curious why the tech has avoided a cap?

“Alexa, what’s tomorrow’s weather [dada]?”

ktos · 2026-03-03T07:59:22 1772524762

"Alexa, what's tomorrow's weather? Over."

"It will be sunny with a high of 10 degrees. Over"

"Thank you. Over and out."

Just add some noise and Push-To-Talk and it will be great for ham radio enthusiasts!

miki_oomiri · 2026-03-03T07:49:41 1772524181

When I speak to an agent, siri, or whatnot, I am always worried that they will assume I'm done talking when I'm thinking. Sometimes I need a many-seconds pause. Even maybe a minute… For Sire and such, I want to ask something simple "Hey Siri, remind me to call dad tomorrow". Easy. But for Claude and such, I want to go on a long monolog (20s, a minute, multi-minutes).

To me, be the best solution would be semantic + keyword + silence.

Hey Agent, blablablabla, thank you.

Hey Agent, blablablabla, please.

Hey Agent, blablablabla, oops cancel.

tmstieff · 2026-03-03T10:01:23 1772532083

I have the same issue. It gives this very weird minor sense of public speaking anxiety where I almost feel the need to write down what I'm about to say, which negates the whole purpose. Only solution I've found is using push-to-talk with some of the system wide STS applications.

iso1631 · 2026-03-03T08:03:33 1772525013

And suddenly your address book has changed the name from "Dad" to "Tomorow"

layer8 · 2026-03-03T09:35:26 1772530526

Never skip an opportunity for a dad joke.

azinman2 · 2026-03-03T06:31:42 1772519502

Because that’s extremely unnatural.

russdill · 2026-03-03T05:26:08 1772515568

I've experimented with having different sized LLMs cooperating. The smaller LLM starts a response while the larger LLM is starting. It's fed the initial response so it can continue it.

The idea of having an LLM follow and continuously predict the speaker. It would allow a response to be continually generated. If the prediction is correct, the response can be started with zero latency.

Barbing · 2026-03-03T05:38:29 1772516309

Google seems to be experimenting with this with their AI Mode. They used to be more likely to send 10 blue links in response to complex queries, but now they may instead start you off with slop.

(Meanwhile at OpenAI: testing out the free ChatGPT, it feels like they prompted GPT 3.5 to write at length based on the last one or maybe two prompts)

russdill · 2026-03-03T07:27:21 1772522841

This is more of a "Are all the windows closed upstairs?"

"The windows upstairs..."

"...are all closed except for the bedroom window"

The first portion of the response requires a couple of seconds to play but only a few tens of milliseconds to start streaming using a small model. Currently I just break the small model's response off at whatever point will produce about enough time to spin up the larger model.

But all responses spin up both models.

Barbing · 2026-03-03T15:54:03 1772553243

Whoa, that thing's fast. Very nice! Bet that's fun to play with, least probably fun the first time you saw it working :)

esperent · 2026-03-03T02:27:59 1772504879

> median delay

Does that mean that half of responses have a negative delay? As in, humans interrupt each others sentences precisely half of the time?

jedberg · 2026-03-03T03:10:32 1772507432

Yes about 1/2 of human speech is interrupting others.

vcxy · 2026-03-03T02:50:23 1772506223

I assume 0 delay is the minimum here, and a median of 0 means over half of the data are exactly 0.

jedberg · 2026-03-03T03:10:51 1772507451

No, about 1/2 of human speech is interrupting others.

vcxy · 2026-03-03T04:04:09 1772510649

oh, interesting, I assumed the data came from interruptions (that seemed obvious) but I'm surprised you had some specific negative measurements. How do you decide the magnitude of the number? Just counting how long both parties are talking?

jedberg · 2026-03-03T04:10:37 1772511037

To be clear, it wasn't my research, I got it from studying some linguistics papers. But it was pretty straightforward. If I am talking, and then you interrupt, and 300ms later I stop talking, then the delay is -300ms.

Same the other way. If I stop taking and then 300ms later you start talking, then the delay is 300ms.

And if you start talking right when I stop, the delay is 0ms.

You can get the info by just listening to recorded conversations of two people and tagging them.

esperent · 2026-03-03T05:17:47 1772515067

I assume there was a lot of variance? As in, some people interrupt others constantly and some do it rarely. Also probably a lot of adjustment depending on the situation, like depending on the relative status of the people, or when people are talking to a young child or non-native speaker.

All that to say, I'd imagine people are adaptable enough to easily handle 100ms+ delay when they know they're talking to an AI.

layer8 · 2026-03-03T09:25:28 1772529928

I disagree with fact 2, voice assistant latency is annoyingly slow. It often causes a conscious wait like “did it work or did it not?”. Cell phone delay is bad as well, it’s certainly not an expectation that carries over to other devices for me.

kvirani · 2026-03-03T10:54:01 1772535241

Isn't fact 2 just a now problem though? Will people's latency expectation not change over time, as it gradually goes down?

jedberg · 2026-03-02T06:44:08 1772433848

The way I write code with AI is that I start with a project.md file, where I describe what I want done. I then ask it to make a plan.md file from that project.md to describe the changes it will make (or what it will create if Greenfield).

I then iterate on that plan.md with the AI until it's what I want. I then ask it to make a detailed todo list from the plan.md and attach it to the end of plan.md.

Once I'm fully satisfied, I tell it to execute the todo list at the end of the plan.md, and don't do anything else, don't ask me any questions, and work until it's complete.

I then commit the project.md and plan.md along with the code.

So my back and forth on getting the plan.md correct isn't in the logs, but that is much like intermediate commits before a merge/squash. The plan.md is basically the artifact an AI or another engineer can use to figure out what happened and repeat the process.

The main reason I do this is so that when the models get a lot better in a year, I can go back and ask them to modify plan.md based on project.md and the existing code, on the assumption it might find it's own mistakes.

jumploops · 2026-03-02T08:03:27 1772438607

I do something similar, but across three doc types: design, plan, and debug

Design works similar to your project.md file, but on a per feature request. I also explicitly ask it to outline open questions/unknowns.

Once the design doc (i.e. design/[feature].md) has been sufficiently iterated on, we move to the plan doc(s).

The plan docs are structured like `plan/[feature]/phase-N-[description].md`

From here, the agent iterates until the plan is "done" only stopping if it encounters some build/install/run limitation.

At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.

We review these hypotheses, sometimes iterate, and then tackle them one by one.

An important note for debug flows, similar to manual debugging, it's often better to have the agent instrument logging/traces/etc. to confirm a hypothesis, before moving directly to a fix.

Using this method has led to a 100% vibe-coded success rate both on greenfield and legacy projects.

Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.

miki123211 · 2026-03-02T09:58:35 1772445515

My "heavy" workflow for large changes is basically as follows:

0. create a .gitignored directory where agents can keep docs. Every project deserves one of these, not just for LLMs, but also for logs, random JSON responses you captured to a file etc.

1. Ask the agent to create a file for the change, rephrase the prompt in its own words. My prompts are super sloppy, full of typos, with 0 emphasis put on good grammar, so it's a good first step to make sure the agent understands what I want it to do. It also helps preserve the prompt across sessions.

2. Ask the agent to do research on the relevant subsystems and dump it to the change doc. This is to confirm that the agent correctly understands what the code is doing and isn't missing any assumptions. If something goes wrong here, it's a good opportunity to refactor or add comments to make future mistakes less likely.

3. Spec out behavior (UI, CLI etc). The agent is allowed to ask for decisions here.

4. Given the functional spec, figure out the technical architecture, same workflow as above.

5. High-level plan.

6. Detailed plan for the first incomplete high-level step.

7. Implement, manually review code until satisfied.

8. Go to 6.

jedberg · 2026-03-02T08:28:22 1772440102

> At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.

I'm biased because my company makes a durable execution library, but I'm super excited about the debug workflow we recently enabled when we launched both a skill and MCP server.

You can use the skill to tell your agent to build with durable execution (and it does a pretty great job the first time in most cases) and then you can use the MCP server to say things like "look at the failed workflows and find the bug". And since it has actual checkpoints from production runs, it can zero in on the bug a lot quicker.

We just dropped a blog post about it: https://www.dbos.dev/blog/mcp-agent-for-durable-workflows

zknill · 2026-03-02T09:20:34 1772443234

Why an MCP? dbos already ships a cli that appears to have the same features. Why an MCP over a skill that gives context on using the cli?

https://docs.dbos.dev/python/reference/cli

jumploops · 2026-03-02T09:51:34 1772445094

> we launched both a skill and MCP server.

My guess is that the MCP was easy enough to add, and some tools only support MCP.

Personal opinion: MCP is just codified context pollution.

jumploops · 2026-03-02T08:41:29 1772440889

This is great, giving agents access to logs (dev or prod) tightens the debug flow substantially.

With that said, I often find myself leaning on the debug flow for non-errors e.g. UI/UX regressions that the models are still bad at visualizing.

As an example, I added a "SlopGoo" component to a side project, which uses an animated SVG to produce a "goo" like effect. Ended up going through 8 debug docs[0] until I was satisified.

[0]https://github.com/jumploops/slop.haus/tree/main/debug

nubinetwork · 2026-03-02T10:43:34 1772448214

> giving agents access to logs (dev or prod) tightens the debug flow substantially.

Unless the agent doesn't know what it's doing... I've caught Gemini stuck in an edit-debug loop making the same 3-4 mistakes over and over again for like an hour, only to take the code over to Claude and get the correct result in 2-3 cycles (like 5-10 minutes)... I can't really blame Gemini for that too much though, what I have it working on isn't documented very well, which is why I wanted the help in the first place...

frumiousirc · 2026-03-02T11:51:44 1772452304

> Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.

FWIW, what you describe maps well to Beads. Your directory structure becomes dependencies between issues, and/or parent/children issue relationship and/or labels ("epic", "feature", "bug", etc). Your markdown moves from files to issue entries hidden away in a JSONL file with local DB as cache.

Your current file-system "UI" vs Beads command line UI is obviously a big difference.

Beads provides a kind of conceptual bottleneck which I think helps when using with LLMs. Beads more self-documenting while a file-system can be "anything".

danenania · 2026-03-02T16:16:17 1772468177

I have a similar process and have thought about committing all the planning files, but I've found that they tend to end up in an outdated state by the time the implementation is done.

Better imo is to produce a README or dev-facing doc at the end that distills all the planning and implementation into a final authoritative overview. This is easier for both humans and agents to digest than bunch of meandering planning files.

wek · 2026-03-02T14:21:52 1772461312

Similar, but we have the agent write the test cases after writing the plan and then iterate until it passes the test cases.

frank00001 · 2026-03-02T07:24:32 1772436272

Sounds like the spec driven approach. You should take a look at this https://github.com/github/spec-kit

kriro · 2026-03-02T12:05:03 1772453103

I basically use a spec driven approach except I only let Github Spec Kit create the initial md file templates and then fill them myself instead of letting the agent do it. Saves a ton of tokens and is reasonably quick and I actually know I wrote the specs myself and it contains what I want. After I'm happy with the md file "harness" I let the agents loose.

The most frustrating issues that pop up are usually library/API conflicts. I work with Gymnasium or PettingZoo and Rlib or stablebaselines3. The APIs are constantly out of sync so it helps to have a working environment were libraries and APIs are in sync beforehand.

jedberg · 2026-03-02T08:25:37 1772439937

Sort of, depending on if your spec includes technology specifics.

For example it might generate a plan that says "I will use library xyz", and I'll add a comment like "use library abc instead" and then tell it to update the plan, which now includes specific technology choices.

It's more like a plan I'd review with a junior engineer.

I'll check out that repo, it might at least give me some good ideas on some other default files I should be generating.

shinycode · 2026-03-02T08:00:49 1772438449

Thanks for the link ! I’m very curious about their choices and methods, I’ll try it

wolletd · 2026-03-02T07:36:59 1772437019

> 110 releases in 6 months

sethammons · 2026-03-02T10:52:25 1772448745

Almost a release per work day, esp. if you count standard holidays.

WXLCKNO · 2026-03-02T16:04:46 1772467486

or OpenSpec https://github.com/Fission-AI/OpenSpec/

I think it's much better

malloryerik · 2026-03-02T09:52:14 1772445134

Have you tried this? Review?

dmd · 2026-03-02T12:36:10 1772454970

https://github.com/obra/superpowers "brainstorming" is pretty much exactly this workflow, and it's great.

nesarkvechnep · 2026-03-02T17:20:26 1772472026

By that time you would’ve written the code yourself, only better.

cortesoft · 2026-03-02T17:49:08 1772473748

I am sure this is partly tongue in cheek, but no, you can’t have written the code yourself in that amount of time. Would the code be better if you wrote it? Probably, depending on your coding skills.

But it would not be faster.

OP is talking about creating an entire project, from scratch, and having it feature complete at the end.

shinycode · 2026-03-02T07:58:04 1772438284

I also do that and it works quite well to iterate on spec md files first. When every step is detailed and clear and all md files linked to a master plan that Claude code reads and updates at every step it helps a lot to keep it on guard rails. Claude code only works well on small increments because context switching makes it mix and invent stuff. So working by increments makes it really easy to commit a clean session and I ask it to give me the next prompt from the specs before I clear context. It always go sideways at some point but having a nice structure helps even myself to do clean reviews and avoid 2h sessions that I have to throw away. Really easier to adjust only what’s wrong at each step. It works surprisingly well

anbende · 2026-03-02T14:20:19 1772461219

Here’s how I do the same thing, just with a slightly different wrapper: I’m running my own stepwise runtime where agents are plugged into defined slots.

I’ll usually work out the big decisions in a chat pane (sometimes a couple panes) until I’ve got a solid foundation: general guidelines, contracts, schemas, and a deterministic spec that’s clear enough to execute without interpretation.

From there, the runtime runs a job. My current code-gen flow looks like this: 1. Sync the current build map + policies into CLAUDE|COPILOT.md 2. Create a fresh feature branch 3. Run an agent in “dangerous mode,” but restricted to that branch (and explicitly no git commands) 4. Run the same agent again—or a different one—another 1–2 times to catch drift, mistakes, or missed edge cases 5. Finish with a run report (a simple model pass over the spec + the patch) and keep all intermediate outputs inspectable

And at the end, I include a final step that says: “Inspect the whole run and suggest improvements to COPILOT.md or the spec runner package.” That recommendation shows up in the report, so the system gets a little better each iteration instead of just producing code.

I keep tweaking the spec format, agent.md instructions and job steps so my velocity improves over time.

--- To answer the original article's question. I keep all the run records including the llm reasoning and output in the run record in a separate store, but it could be in repo also. I just have too many repos and want it all in one place.

CompoundLoop · 2026-03-02T17:46:03 1772473563

What store do you use for your run records? A separate git repo? or do you have some SQL lite db holding the records.

anbende · 2026-03-02T18:27:39 1772476059

Hi there. Right now they are going to a separate git repo, yes. Like this:

local-governor/epics/e-epics/e014-clinical-domain-model/runs/run-e014-01-ops-catalog-20260302-173907-244c82

- Attempts

+ Steps

  - Step 1

  - Step 2

  - ...

  - Step 13

job_def.yaml

job_instance.json

changes_final.patch

run_report.md

improvement_suggestions.md

local-governor is my store for epics, specs, run records, schemas, contracts, etc. No logic, just files. I want all this stuff in a DB, but it's easier to just drop a file path into my spec runner or into a chat window (vscode chat or cli tool), but I'm tinkering with an alt version on a cloud DB that just projects to local files... shrug. I spend about as much time on tooling as actual features :)

RHSeeger · 2026-03-02T15:20:29 1772464829

I do something similar - A full work description in markdown (including pointers to tickets, etc); but not in a file - A "context" markdown file that I have it create once the plan is complete... that contains "everything important that it would need to regenerate the plan" - A "plan" markdown file that I have it create once the plan is complete

The "context" file is because, sometimes, it turns out the plan was totally wrong and I want to purge the changes locally and start over; discussing what was done wrong with it; it gives a good starting point. That being said, since I came up with the idea for this (from an experience it would have been useful and I did not have it) I haven't had an experience where I needed it. So I don't know how useful it really is.

None of that ^ goes into the repo though; mostly because I don't have a good place to put it. I like the idea though, so I may discuss it with my team. I don't like the idea of hundreds of such files winding up in the main branch, so I'm not sure what the right approach is. Thank you for the idea to look into it, though.

Edit: If you don't mind going into it, where do you put the task-specific md files into your repo, presumably in a way that doesn't stack of over time and cause ... noise?

giancarlostoro · 2026-03-02T19:44:42 1772480682

This is how I used to use Beads before I made GuardRails[0]. I basically iterate with the model, ask it to do market research, review everything it suggests, and you wind up with a "prompt" that tells it what to do and how to work that was designed by the model using its own known verbiage. Having learned about how XML could be used to influence Claude I'm rethinking my flow and how GuardRails behaves.

[0]: https://giancarlostoro.com/introducing-guardrails-a-new-codi...

adam_patarino · 2026-03-02T13:44:55 1772459095

You check the plan files into git? Don’t you end up with dozens of md files?

I’ve been copying and pasting the plan into the linear issue or PR to save it, but keep my codebase clean.

thearn4 · 2026-03-02T14:59:06 1772463546

Yeah I had the same question. I suppose you could put the project+plan text into the commit message?

8note · 2026-03-02T19:42:35 1772480555

the real question is when peer feedback and review happens.

is making the project file collaborative between multiple engineers? the plan file?

ive tried some variants of sharing different parts but it feels like ots almost water effort if the LLM then still goes through multiple iterations to get whats right, the oroginal plan and project gets lost a bit against the details of what happened in the resulting chat

the-grump · 2026-03-02T07:19:03 1772435943

Stealing this brilliant idea. Thank you for sharing!

jedberg · 2026-03-02T08:23:48 1772439828

I wish I could say I came up with it, but it's just a small variation on something I saw here on HN!

peyton · 2026-03-02T07:32:36 1772436756

For big tasks you can run the plan.md’s TODOs through 5.2 pro and tell it to write out a prompt for xyz model. It’ll usually greatly expand the input. Presumably it knows all the tricks that’ve been written for prompting various models.

winwang · 2026-03-02T16:26:07 1772468767

Interesting! I actually split up larger goals into two plan files: one detailed plan for design, and one "exec plan" which is effectively a build graph but the nodes are individual agents and what they should do. I throw the two-plan-file thing into a protocol md file along with a code/review loop.

odiroot · 2026-03-02T16:59:33 1772470773

How do you use your agent effectively for executing such projects in bigger brownfield codebases? It's always a balance between the agent going way too far into NIH vs burning loads and loads of tokens for the initial introspection.

matkoniecz · 2026-03-02T12:08:20 1772453300

I do the same, but put it as a comment on top of generated file.

(So far I have not used LLMs to generate code larger than fitting in one file.)

Overall idea is that I modify and tweak prompt, and keep starting new LLM sessions and dispose of old ones.

stackghost · 2026-03-02T06:49:53 1772434193

>I then iterate on that plan.md with the AI until it's what I want.

Which tools/interface are you using for this? Opencode/claude code? Gas town?

StrangeSound · 2026-03-02T06:51:04 1772434264

I find that Antigravity is really good for this. You can comment on the plan documents in-line.

d1sxeyes · 2026-03-02T11:43:36 1772451816

Best feature of Antigravity

anshumankmr · 2026-03-02T07:29:17 1772436557

While I have not commited my personal mind map, I just had Claude Code write it down for me. Plus I have a small Claude.MD, copilots-innstructions.md that are mentioning the various intricacies of what I am working on so the agent knows to refer to that file.

jedberg · 2026-03-02T08:23:21 1772439801

I'm using the Claude desktop app and vi at the moment. But honestly I would probably do better with a more modern editor with native markdown support, since that's mostly what I'm writing now.

tlb · 2026-03-02T12:24:03 1772454243

Do you clear the file and use the same name for the next commit? Or create a new directory with a plan.md for each set of changes?

fhub · 2026-03-02T08:29:15 1772440155

I do something similar but I get Claude to review Codex every step of the way and feed it back (or visa versa depending on day)

jedberg · 2026-03-02T08:34:52 1772440492

My next step was to add in having another LLM review Claude's plans. With a few markdown artifacts it should be easy for the other LLM to figure it out and make suggestions.

vorticalbox · 2026-03-02T11:46:37 1772451997

you may like openspec[0]

[0] https://openspec.dev/

iainmck29 · 2026-03-02T12:12:59 1772453579

is this not what entire.io is doing? Was founded by the old Github CEO Thomas Dohmke

plsft · 2026-03-02T15:43:10 1772466190

Yes, when I first saw this, its exactly what I thought of.

moderation · 2026-03-02T16:22:19 1772468539

No mention of Agent Trace [0] yet. Interestingly, Entire are not supporting Agent Trace [1]

0. https://agent-trace.dev/

1. https://github.com/entireio/cli/issues/386

esafak · 2026-03-02T17:20:00 1772472000

Their response seems reasonable.

Bombthecat · 2026-03-02T12:51:25 1772455885

Then you might like to look into automaker.

jedberg · 2026-03-01T03:48:51 1772336931

Interesting, I'm running Sequoia and have never seen that.

However, I'm running Sequoa developer beta. In my system settings under Beta updates, I have "Sequoia developer beta" selected.

At this point it's basically just getting the Sequoa security patches a few days early. But I guess it also suppresses this message?

jedberg · 2026-03-01T03:41:13 1772336473

If you're worried about privacy and security, why did you choose Inngest, which sends all your private data to Inngest? If you want truly private durable execution, you should check out DBOS.

jedberg · 2026-03-01T03:16:11 1772334971

If you want a true lesson on design, check out Ask Tog, starting here:

https://asktog.com/atc/principles-of-interaction-design/

Tog was the original design engineer for the Mac, and arguably one of the first true HCI engineers.

Then read the rest of his website. He goes into where Windows tried to copy Mac and got it horribly wrong.

One of my favorite examples is menu placement. The reason the Mac menus are at the top is because the edges of the screen provide an infinite click target in one direction. So you just go to the top to find what you want. With Windows, the menu was at the top of each Window, making a tiny click target. Then when you maximized the window, the menu was at the top, but with a few pixels of unclickable border. So it looked like the Mac but was infinitely worse.

If you're making a UI, you should read all of Tog's writings.

hakfoo · 2026-03-01T06:59:02 1772348342

I understand the Fitt's Law concepts behind a top menu bar, but I wonder if this is a scenario with moving goalposts.

On a 1984 Mac, you had like 512x384 pixels and a system that could barely run one program at a time. There was little to no possible uncertainty as to who owned the menu bar. (Could desk accessories even take control of the menu bar?)

But once you got larger resolutions and the ability to have multiple full-size programs running at once, the menu bar could belong to any of them. Now, theoretically, you should notice which is the currently active window and assume it owns the menu bar, but ISTR scenarios where you'd close the window but the program would still be running, owning the menu bar, or the "active" window was less visually prominent due to task switching, etc.

The Windows design-- placing the menu inside the window it controls-- avoids any ambiguity there. Clicking "File-Save" in Notepad couldn't possibly be interpreted as trying to do anything to the Paintbrush window next to it.

canucker2016 · 2026-03-01T11:19:21 1772363961

The problem with the Mac UI is that the app's menubar can only be accessed by the mouse (can't remember what accessibility-enabled mode would allow).

Under Windows, one can access the app's menubar by pressing the ALT key to move focus up to the menubar and use the cursor keys to navigate along the menubar. If you know the letter associated with the top-level menu (shown as underlined), then ALT-[letter] would access that top-level menu (typically ALT-F would get you to the File menu). So the Windows user wouldn't have to move the mouse at all, Fitt's Law to the max (or is it min? whatever, it's instant access).

For the ultrawide monitors these days (width >= 4Kpx), if you have an app window maximized (or even spanning more than half the screen), accessing the menu via mouse is just terrible ergonomics on any major OS.

jonhohle · 2026-03-01T14:18:31 1772374711

Since OS X 10.3 (2003) Control+F2 moves focus to the Apple menu. The arrow keys can then select any menu item which is selected with Return or canceled with Escape. Command+? will bring you to a search box in the Help menu. Not only that, any menu item in any app can be bound to any keyboard shortcut of the user's choosing not just the defaults provided by the system or application.

wmf · 2026-03-01T05:56:49 1772344609

AFAIK Windows 3.x flipped a bunch of Mac decisions to avoid being sued and then MS felt that they had to keep those choices forever for backwards compatibility.

jonhohle · 2026-03-01T14:20:20 1772374820

And in my experience, when people moved from Windows to the Mac they're so annoyed that there are differences. When I try to explain that these were present in the Mac long before Windows, people start to understand.

bediger4000 · 2026-03-01T03:30:56 1772335856

You can generalize this observation to a lot of Microsoft's decisions: a problem exists, so they solve it in a nifty way, a way that makes everything else harder or more error prone. An example: byte order mark. That sure does solve the problem of UTF-16 and UTF-32 byte order determination. It makes every other use of what should be a stream of bytes or words much harder. Concatenate two files? Gotta check for the BOM on both files. Now every app has to look at the first bytes of every "text" file it opens to decide what to do. Suddenly, "text" files have become interpreted, and thus open to allowing security vulnerabilities.

kryptiskt · 2026-03-01T09:21:58 1772356918

> So it looked like the Mac but was infinitely worse.

"Infinitely worse"? Some people really need to cool off the hyperbole.

Having each window be a self-contained unit is the far better metaphor than making each window transform a global element when it is selected. As well as scaling better for bigger screens. An edge case like that may well be unfortunate, but it could be the price you pay to make the overall better solution.

jonhohle · 2026-03-01T14:29:21 1772375361

That was the point of Tog's conclusion: edges of the screen have infinite target size in one cardinal direction, corners have infinite target size in two cardinal directions. Any click target that's not infinite in comparison, has infinitely smaller area, which I suppose you could conclude is infinitely worse if clickable area is your primary metric.

This wasn't just the menu bar either. The first Windows 95-style interfaces didn't extend the start menu click box to the lower left corner of the screen. Not only did you have to get the mouse down there, you had to back off a few pixels in either direction to open the menu. Same with the applications in the task bar.

The concept was similar to NEXTSTEP's dock (that was even licensed by Microsoft for Windows 95), but missed the infinite area aspect that putting it on the screen edge allowed.

jedberg · 2026-03-01T09:46:55 1772358415

The infinitely worse part was when you maximized the window so the menu bar was at the top, but Windows still had the border there, which was unclickable.

So now you broke the infinite click target even though it looked like it should have one.

pixelesque · 2026-03-01T10:53:38 1772362418

> So it looked like the Mac but was infinitely worse.

On single monitor setups maybe: but on early OS X multi-monitor setups, you then had the farcical situation where the menu would only be shown on the "primary" display, and the secondary display didn't have any menu at all, so to use menus for windows that were on the secondary display, you had to move the cursor onto the other primary display where the menu was for all windows (or use keyboard shortcuts).

I think 10.6/7 (not sure exactly) was when they started putting the menu bar on both displays rather than just the primary.

jedberg · 2026-03-01T02:12:12 1772331132

From what I can tell, the key difference between Anthropic and OpenAI in this whole thing is that both want the same contract terms, but Antropic wants to enforce those terms via technology, and OpenAI wants to enforce them by ... telling the Government not to violate them.

It's telling that the government is blacklisting the company that wants to do more than enforce the contract with words on paper.

retsibsi · 2026-03-01T05:48:37 1772344117

I think it's dumber than that; the terms of the contract, as posted by OpenAI (https://openai.com/index/our-agreement-with-the-department-o...), are basically just "all lawful purposes" plus some extra words that don't modify that in any significant way.

> The Department of War may use the AI System for all lawful purposes, consistent with applicable law, operational requirements, and well-established safety and oversight protocols. The AI System will not be used to independently direct autonomous weapons in any case where law, regulation, or Department policy requires human control, nor will it be used to assume other high-stakes decisions that require approval by a human decisionmaker under the same authorities. Per DoD Directive 3000.09 (dtd 25 January 2023), any use of AI in autonomous and semi-autonomous systems must undergo rigorous verification, validation, and testing to ensure they perform as intended in realistic environments before deployment.

> For intelligence activities, any handling of private information will comply with the Fourth Amendment, the National Security Act of 1947 and the Foreign Intelligence and Surveillance Act of 1978, Executive Order 12333, and applicable DoD directives requiring a defined foreign intelligence purpose. The AI System shall not be used for unconstrained monitoring of U.S. persons’ private information as consistent with these authorities. The system shall also not be used for domestic law-enforcement activities except as permitted by the Posse Comitatus Act and other applicable law.

So it seems that Anthropic's terms were 'no mass domestic surveillance or fully autonomous killbots', the government demanded 'all lawful use', and the OpenAI deal is 'all lawful use, but not mass domestic surveillance or fully autonomous killbots... unless mass domestic surveillance or fully autonomous killbots are lawful, in which case go ahead'.

qwertox · 2026-03-01T09:57:47 1772359067

> will not be used to independently direct autonomous weapons in any case where law, regulation, or Department policy requires human control

That says it all. Those laws get issued the same way the tariffs did.

_heimdall · 2026-03-01T02:22:30 1772331750

That isn't my understanding. OpenAI and others are wanting to limit the government to doing what is lawful based on what laws the government writes. Anthropic is wanting to draw their own line on what is allowed regardless of laws passed.

roxolotl · 2026-03-01T02:53:49 1772333629

I’m so confused by the focus on “all lawful use.” Yea of course a contract without terms of use implicitly is restricted by laws. But contracts with terms of use are incredibly common, if not almost every single contract ever signed.

fc417fc802 · 2026-03-01T03:56:06 1772337366

The administration objected to those terms of use. Anthropic refused to compromise on them. OpenAI agreed to permit "all lawful use" but claims to have insisted on what at first glance appears to be terms of use in their contract. But in reality those terms permit all lawful use and thus are a noop.

xvector · 2026-03-01T03:31:29 1772335889

"All lawful use" is the weasel word that makes the whole contract useless for the purposes of safety.

That is why it is the focus of this debate.

abofh · 2026-03-01T04:29:52 1772339392

If the president does it, it's not illegal.

These were words issued by the president - which means at face value, if Trump orders it, it's not illegal - that was the fight that was lost today.

DrewADesign · 2026-03-01T04:38:18 1772339898

Not just the president — the Supreme Court agreed.

_heimdall · 2026-03-01T11:34:09 1772364849

"All lawful USS" in the hands of those that decide what is lawful is effectively a blank check. They want a terms of use that says "I do what I want."

micromacrofoot · 2026-03-01T14:45:10 1772376310

more based on what the government permits by not litigating rather than written law

plaidfuji · 2026-03-01T21:07:40 1772399260

The key difference is that Anthropic aired their disagreement with the DoD publicly, and the DoD is not going to work with a company that tries to exert any amount of control over their relationship via the public sphere. Same goes for Trump.

I think Anthropic knew full well that by publishing their disagreement, it would sink the deal and relationship, and I think they also calculated (correctly) that that act of defiance would get them good publicity and potentially peel away some of OpenAIs user base. I think this profit incentive happened to align with their morals, and now here we are.

maest · 2026-03-01T12:57:55 1772369875

I thought the key difference was that Brockman is top Trump donor, with USD 25M total [1]. I know it's technically not allowed, but do you think such a large amount of money would have swayed Trump in his decision?

[1] - https://the-decoder.com/openai-co-founder-greg-brockman-dona...

jedbdbdjdj · 2026-03-01T02:25:08 1772331908

No, it’s significantly worse than that. OpenAI has required zero actual guarantees from the government and Sam. The psychopath is lying to you. All the government has to do is have a lawyer say it’s legal, and most of the government’s lawyers are folks who were involved in attempting to overthrow the last election and should’ve been convicted of treason, so that means very little.

Sam stands for nothing except his own greed

reckless · 2026-03-01T18:36:07 1772390167

Anthropic wants to enforce them via language of the contracts and take a hands off approach. OpenAI has a contract that is paired with humans in the room (FDEs) that can pull the plug.

jedberg · 2026-03-01T01:23:55 1772328235

Anthropic isn't preventing them from managing their key technologies. If my software license says 1000 users, and I build into the software that you can only connect with 1000 users, is your argument that the government can no longer manager their tech?

That my software should allow license violations if the government thinks it is necessary?

FarmerPotato · 2026-03-01T02:23:36 1772331816

I worked in defense contracting looong ago, so this is old news: when software is purchased by DoD or Govt generally, FAR compliance notices make it a license, not a sale of IP.

scottyah · 2026-03-01T03:25:49 1772335549

There are so many license types, DoW buys into all sorts.

jedberg · 2026-02-28T20:02:35 1772308955

It’s DoD. It’s still not officially changed. But if you insist on using the nickname it should be DoW.