More

netcoyote · 2026-03-05T07:13:15 1772694795

It is huge, but real (see https://news.ycombinator.com/item?id=47258500)

Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.

The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.

alpaca128 · 2026-03-06T00:19:13 1772756353

In my case it doesn't seem to be related to system load. I have an issue where (mainly) using FF can trigger random system freezes on Linux, often with the browser going down first. But running CPU/memory stress tests, compiling things etc don't cause any errors and the cooler is downright bored.

netcoyote · 2026-03-05T07:06:43 1772694403

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

dpe82 · 2026-03-06T01:42:28 1772761348

As a mobile dev at YouTube I'd periodically scroll through crash reports associated with code I owned and the long tail/non-clustered stuff usually just made absolutely no sense and I always assumed at least some of it was random bit flips, dodgy hardware, etc.

dvngnt_ · 2026-03-06T00:31:18 1772757078

GW1 was my childhood. The MMO with no monthly fees appealed to my Mom and I met friends for years. The 8 skill build system was genius, as was the cut scenes featuring your player character. If there's ever a 3rd game I would love to see something allowing for more expression through build creation though I could see how that's hard to balance.

ndesaulniers · 2026-03-06T03:09:14 1772766554

I still remember summoning flesh golems as a necromancer! Too much of my life sunk into GW1. Beat all 4(?) expansions. Logged in years later after I finally put it down to find someone had guessed my weak password, stole everything, then deleted all my characters. C'est la vie.

jiggunjer · 2026-03-06T00:44:34 1772757874

Didn't they launch a remake of gw1 recently. Maybe I can get my kids hooked on that instead of this Roblox crap.

pndy · 2026-03-06T01:06:41 1772759201

Yes, they did relaunch it as Guild Wars Reforged with Steam Deck and controller support and other changes

https://wiki.guildwars.com/wiki/Guild_Wars_Reforged

post-it · 2026-03-06T02:31:32 1772764292

For what it's worth, Roblox is how I discovered code at age 10.

youarentrightjr · 2026-03-06T03:04:00 1772766240

How do you mean? Is there programming inside the game (ala Minecraft or Factorio)?

cortesoft · 2026-03-06T04:02:59 1772769779

Roblox is basically a developer platform for making games

LoganDark · 2026-03-06T03:07:16 1772766436

Roblox has a development environment for creating games (Roblox Studio) and the engine uses a fork of Lua as a scripting language.

I also was introduced to programming through Roblox.

mobilio · 2026-03-06T00:12:21 1772755941

Yup!

I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

john_strinlai · 2026-03-06T01:35:59 1772760959

for people that dont know, www.codeofhonor.com is netcoyotes (the gp comment) blog, and there is some good reading to be had there

Modified3019 · 2026-03-05T23:42:47 1772754167

Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.

I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.

From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)

What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.

I hate to think of how much effort will be spent debugging software in vain because of memory errors.

monster_truck · 2026-03-06T02:48:19 1772765299

DDR4 and 5 both have similar heat sensitivity curves which call for increased refresh timings past 45C.

Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.

On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.

I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.

golem14 · 2026-03-06T00:46:51 1772758011

Hmm, I wonder if we see, now since we are in a RAM availability crisis, more borderline to bad RAMs creep into the supply chain.

If we had a time series graph of this data, it might be revealing.

monster_truck · 2026-03-06T02:54:27 1772765667

If you look around you'll see people already putting the new, chinese made DDR4 through its paces, it's holding up far better than anyone expected.

Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.

kmeisthax · 2026-03-06T00:19:04 1772756344

> From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.

I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.

[0] In practice, if they didn't, they'd all just flock to AMD.

gruez · 2026-03-06T00:37:29 1772757449

>[0] In practice, if they didn't, they'd all just flock to AMD.

only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.

pushedx · 2026-03-06T00:48:17 1772758097

At the beginning of your comment I was wondering if the "attitude" that was corporate serving was the anti-ECC stance or the pro-ECC stance (based on the full chunk that you quoted). I'm glad that by the end of the comment you were clearly pro ECC.

Any workstation where you are getting serious work done should use ECC

jug · 2026-03-06T01:04:41 1772759081

As a community alpha tester of GW1, this was a fun read! Such an educational journey and what a well organized and fruitful one too. We could see the game taking shape before our eyes! As a European, I 100% relied on being young and single with those American time zones. :D Tests could end in my group at like 3 am, lol.

pndy · 2026-03-05T08:36:04 1772699764

I didn't expect to read bits of GW story here from one of the founders - thanks!

monster_truck · 2026-03-06T02:36:04 1772764564

Every interesting bug report I've read about Guild Wars is Dwarf Fortress tier. A very hardcore, longtime player who was recounting some of the better ones to me shared a most excellent one wrt spirits or ghosts, some sort of player summoned thing that were sticking around endlessly and causing OOM errors?

Analemma_ · 2026-03-05T22:46:10 1772750770

There's a famous Raymond Chen post about how a non-trivial percentage of the blue screen of death reports they were getting appeared to be caused by overclocking, sometimes from users who didn't realize they had been ripped off by the person who sold them the computer: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35.... Must've been really frustrating.

jnellis · 2026-03-06T03:32:33 1772767953

This was a design choice by AMD at the time for their Athlon Slot A cpus. Use the same slot A board which you could set the cpu speed by bridging a connections. Since the Slot A came in a package, you couldn't see the actual cpu etching. So shady cpu sellers would pull the cover off high speed cpus, and put them on slow speed cpus after overclocking them to unstable levels.

projektfu · 2026-03-06T00:35:34 1772757334

E.g., running a Pentium 75, at 75MHz.

arprocter · 2026-03-05T23:25:15 1772753115

>Sometimes I'm amazed that computers even work at all!

Funny you say this, because for a good while I was running OC'd RAM

I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)

just_testing · 2026-03-06T02:53:58 1772765638

I loved reading your comment and got curious: how he detected the bitflips?

mayama · 2026-03-06T03:29:49 1772767789

It looks like computing math heavy process with known answer, like 301st prime, and comparing the result.

General memory testing programs like memtest86 or memtester sets random bits into memory and verify it.

Salgat · 2026-03-06T01:54:13 1772762053

Mike is such a legend.

jiggawatts · 2026-03-06T02:58:10 1772765890

Some multiplayer real-time strategy (RTS) games used deterministic fixed-point maths and incremental updates to keep the players in sync. Despite this, there would be the occasional random de-sync kicking someone out of a game, more than likely because of bit flips.

cookiengineer · 2026-03-06T02:49:07 1772765347

I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.

The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.

The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).

The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.

After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.

hsbauauvhabzb · 2026-03-05T23:15:20 1772752520

Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?

I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?

nomel · 2026-03-06T00:13:03 1772755983

I think the most reasonable take would be to just tell the users hardware is borked, they're going to have a bad outside the game too, and point them to one of the many guides around this topic.

I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!

To counter that, we're LONG overdue for ECC in all consumer systems.

AlotOfReading · 2026-03-06T01:37:46 1772761066

I put engineering effort into handling bad hardware all the time because safety critical, :)

It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.

80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.

shakna · 2026-03-06T01:18:41 1772759921

I think I sit in another camp. A lot of my engineering efforts are in working around bad hardware.

Better the user sees some lag due to state rebuild versus a crash.

Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.

andai · 2026-03-06T00:07:58 1772755678

That's an interesting idea. How might you implement that? Like RAID but on the level of variables? Maybe the one valid use case for getters/setters? :)

hsbauauvhabzb · 2026-03-06T00:43:39 1772757819

As another user fairly pointed out, ECC. But a compiler level flag would probably achieve the redundancy, sourcing stuff from disk etc would probably still need to happen twice to ensure that bit flips do not occur, etc.

netcoyote · 2026-02-27T23:05:18 1772233518

Sandvault author here: thanks for the shout-out!

I would add that in addition to Unix permissions, sandvault also utilizes macOS sandbox-exec to further limit the blast radius.

netcoyote · 2026-01-26T01:42:57 1769391777

- https://github.com/webcoyote/sandvault: sandboxes AI agents in a MacOS limited user account, and also uses sandbox-exec to limit access, though fence has more strict limitations

- https://github.com/webcoyote/clodpod: sandboxes AI agents in a MacOS virtual machine

Note: I’m the author of both of these Apache open-source projects

netcoyote · 2026-01-20T15:18:09 1768922289

This is the solution I chose for sandvault [0], which works well on my Mac since agents can run OSX-specific tools.

It just got added to Homebrew:

    brew install sandvault

Or clodpod [1] for a VM-based solution

0: https://github.com/webcoyote/sandvault

1: https://github.com/webcoyote/clodpod

netcoyote · 2026-01-16T20:02:28 1768593748

> I left my Mac on top of my car in San Francisco once and the next day when I came back it was still there.

Not the latest model, huh? That’s certainly a passive-aggressive way to suggest you upgrade…

netcoyote · 2026-01-16T03:47:00 1768535220

Here are my open-source (MIT) solutions for Mac development:

SandVault [0]: Run AI agents isolated in a sandboxed macOS user account

ClodPod [1]: Run AI agents isolated inside an OSX virtual machine

0: https://github.com/webcoyote/sandvault

1: https://github.com/webcoyote/clodpod

WA · 2026-01-16T07:54:28 1768550068

Thanks for sharing. Which one do you use for what?

netcoyote · 2025-12-27T08:19:17 1766823557

I use a Mac, and wanted to be able to run MacOS programs like Xcode and iOS simulator, so I wrote a couple of different sandbox projects:

- SandVault (https://github.com/webcoyote/sandvault) runs the AI agent in a low-privilege account

- ClodPod (https://github.com/webcoyote/clodpod) runs the AI agent inside a MacOS VM

In both cases I map my code directories using shares/mounts.

I find that I use the low-privilege account solution more because it's easier to setup and doesn't require the overhead of a full VM

tmaly · 2026-01-02T15:22:52 1767367372

do you have a write up on your setup?

netcoyote · 2025-12-15T06:17:51 1765779471

After reading an article about doing 10,000 pushups in a year (https://wjgilmore.com/articles/10000-pushups), I created "push10k", an iOS app to help me keep track and stay motivated. It's free (no money, no ads) in the iOS app store: https://apps.apple.com/us/app/push10k/id6754811078.

netcoyote · 2025-12-03T01:14:40 1764724480

The Vorkosigan Series, by Louise McMasters Bujold. She’s won six (!!!) Hugo awards for her writing, and as Anne McCaffery says, “Boy, can she write”.

Space opera with warfare, intrigue, politics, drama, and world building.