Hacker Newsnew | past | comments | ask | show | jobs | submit | teraflop's commentslogin

Yes, and it's not just about application-level integrity. The WAL operates at a page level, so dropping one WAL entry and then applying later ones would be likely to cause corruption at the B-tree level.

For instance, say you have a node A which has a child B:

* Transaction 1 wants to add a value to B, but it's already full, so B is split into new nodes C and D. Correspondingly, the pointer in A that points to B is removed, and replaced with pointers to C and D.

* Transaction 2 makes an unrelated change to A.

If you skip the updates from transaction 1, and apply the updates from transaction 2, then suddenly A's data is overwritten with a new version that points to nodes C and D, but those nodes haven't been written. The pointers just point to uninitialized garbage.


> The checksums in WAL are likely not meant to check for random page corruption in the middle; maybe they’re just to check if the last write of a frame was fsynced properly or not?

This is the correct explanation. The purpose is to detect partial writes, not to detect arbitrary data corruption. If detecting corruption was the goal, then checksumming the WAL without also checksumming the database itself would be fairly pointless.

In fact, it's not accurate to say "SQLite does not do checksums by default, but it has checksums in WAL mode." SQLite always uses checksums for its journal, regardless of whether that's a rollback journal or a write-ahead log. [1]

For the purpose of tolerating and recovering from crashes/power failures, writes to the database file itself are effectively idempotent. It doesn't matter if only a subset of the DB writes are persisted before a crash, and you don't need to know which ones succeeded, because you can just roll all of them forward or backward (depending on the mode). But for the journal itself, distinguishing partial journal entries from complete ones matters.

No matter what order the disk physically writes out pages, the instant when the checksum matches the data is the instant at which the transaction can be unambiguously said to commit.

[1]: https://www.sqlite.org/fileformat.html


Exactly. To put it another way:

Imagine the power goes out while sqlite is in the middle of writing a transaction to the WAL (before the write has been confirmed to the application). What do you want to happen when power comes back, and you reload the database?

If the transaction was fully written, then you'd probably like to keep it. But if it was not complete, you want to roll it back.

How does sqlite know if the transaction was complete? It needs to see two things:

1. The transaction ends with a commit frame, indicating the application did in fact perform a `COMMIT TRANSACTION`.

2. All the checksums are correct, indicating the data was fully synced to disk when it was committed.

If the checksums are wrong, the assumption is that the transaction wasn't fully written out. Therefore, it should be rolled back. That's exactly what sqlite does.

This is not "data loss", because the transaction was not ever fully committed. The power failure happened before the commit was confirmed to the application, so there's no way anyone should have expected that the transaction is durable.

The checksum is NOT intended to detect when the data was corrupted by some other means, like damage to the disk or a buggy app overwriting bytes. Myriad other mechanisms should be protecting against those already, and sqlite is assuming those other mechanisms are working, because if not, there's very little sqlite can do about it.


Why is the commit frame not sufficient to determine whether the transaction was fully written or not? Is there a scenario where the commit frame is fsynced to disk but the proceeding data isn't?

The disk controller may decide to write out blocks in a different order than the logical layout in the log file itself, and be interrupted before completing this work.

It’s worth noting this is also dependent on filesystem behavior; most that do copy-on-write will not suffer from this issue regardless of drive behavior, even if they don’t do their own checksumming.

We still have the elevator algorithm on NVMe?

NVMe drives do their own manipulation of the datastream. Wear leveling, GC, trying to avoid rewriting an entire block for your 1 bit change, etc. NVMe drives have CPUs and RAM for this purpose; they are full computers with a little bit of flash memory attached. And no, of course they're not open source even though they have full access to your system.

Anything that uses NAND storage technology is going to be optimized in some way like this. NVMe is just the messenger.

SQLite runs on anything from servers to Internet-connected lightbulbs.

Which lightbulbs include SQLite? I kind of want one.

these guys have a Cree logo on their homepage so maybe Cree?

https://imaginovation.net/case-study/cree/

At least what I could turn up with a quick web search.


For instance, running on ZFS or one of its peers.

ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS, because otherwise you will have the same experience I had for years; SQLite may randomly hang for up to a few minutes with no visible cause, if there isn’t sufficient write txg’s to fill up in the background. If your app depends on SQLite, it’ll randomly die.

Btrfs is a better choice for sqlite, haven’t seen that issue there.


Interesting. Found a GitHub issue that covers this bug: https://github.com/openzfs/zfs/issues/14290

The latest comment seems to be a nice summary of the root cause, with earlier in the thread pointing to ftruncate instead of fsync being a trigger:

>amotin

>I see. So ZFS tries to drop some data from pagecache, but there seems to be some dirty pages, which are held by ZFS till them either written into ZIL, or to disk at the end of TXG. And if those dirty page writes were asynchronous, it seems there is nothing that would nudge ZFS to actually do something about it earlier than zfs_txg_timeout. Somewhat similar problem was recently spotted on FreeBSD after #17445, which is why newer version of the code in #17533 does not keep references on asynchronously written pages.

Might be worth testing zfs_txg_timeout=1 or 0


> ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS

Which you can do on a per dataset ('directory') basis very easily:

    zfs set sync=disabled mydata/mydb001
* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...

Meanwhile all the rest of your pools / datasets can keep the default POSIX behaviour.


You know what's even easier than doing that? Neglecting to do it or meaning to do it then getting pulled in to some meeting (or other important distraction) and then imagining you did it.

> Neglecting to do it or meaning to do it then getting pulled in to some meeting (or other important distraction) and then imagining you did it.

If your job is to make sure your file system and your database—SQLite, Pg, My/MariDB, etc—are tuned together, and you don't tune it, then you should be called into a meeting. Or at least the no-fault RCA should bring up remediation methods to make sure it's part of the SOP so that it won't happen again.

The alternative the GP suggests is using Btrfs, which I find even more irresponsible than your non-tuning situation. (Heck, if someone on my sysadmin team suggested we start using Btrfs for anything I would think they were going senile.)


Facebook is apparently using it at scale, which surprised me. Though that’s not necessarily an endorsement, and who knows what their kernel patcheset looks like.

Doesn't turning off sync mean you can lose confirmed writes in a power failure?

Apropos this use case, ZFS is usually not recommended for databases. Competent database storage engines have their own strong corruption detection mechanisms regardless. What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.

But what ZFS provides isn't weaker, and in SQLite page checksums are opt-in: https://www.sqlite.org/cksumvfs.html

EDIT: It seems they're opt-in for PostgreSQL, too: https://www.postgresql.org/docs/current/checksums.html


you might like my other post - https://avi.im/blag/2024/databases-checksum/

bad news is, most databases don't do checksums by default.


This is in fact good news.

Redundantly performing the same performance-intensive tasks on multiple layers makes latency less predictable and just generally wastes resources.


Actually bad news. Most popular filesystems and filesystem configurations have limited and/or weak checksums, certainly much worse than you'd want for a database. 16-bit and 32-bit CRCs are common in filesystems.

This is a major reason databases implement their own checksums. Unfortunately, many open source databases have weak or non-existent checksums too. It is sort of an indefensible oversight.


So when checksums are enabled and the DB process restarts or the host reboots does the DB run the checksum over all the stored data? Sounds like it would take forever for the database to come online. But if it doesn't it may not detect bitrot in time...?

Hm.

On the other hand, I've heard people recommend running Postgres on ZFS so you can enable on the fly compression. This increases CPU utilization on the postgres server by quite a bit, read latency of uncached data a bit, but it decreases necessary write IOPS a lot. And as long as the compression is happening a lot in parallel (which it should, if your database has many parallel queries), it's much easier to throw more compute threads at it than to speed up the write-speed of a drive.

And after a certain size, you start to need atomic filesystem snapshots to be able to get a backup of a very large and busy database without everything exploding. We already have the more efficient backup strategies from replicas struggle on some systems and are at our wits end how to create proper backups and archives without reducing the backup freqency to weeks. ZFS has mature mechanisms and zfs-send to move this data around with limited impact ot the production dataflow.


Is an incremental backup of the database not possible? Pgbackrest etc. can do this by creating a full backup followed by incremental backups from the WAL.

For Postgres specifically you may also want to look at using hot_standby_feedback, as described in this recent HN article: https://news.ycombinator.com/item?id=44633933


On the big product clusters, we have incremental pgbackrest backups running for 20 minutes. Full backups take something between 12 - 16 hours. All of this from a sync standby managed by patroni. Archiving all of that takes 8 - 12 hours. It's a couple of terabytes on noncompressible data that needs to move. It's fine though, because this is an append-log-style dataset and we can take our time backing this up.

We also have decently sized clusters with very active data on them, and rather spicy recovery targets. On some of them, a full backup from the sync standby takes 4 hours, we need to pull an incremental backup at most 2 hours afterwards, but the long-term archiving process needs 2-3 hours to move the full backup to the archive. This is the first point in which filesystem snapshots, admittedly, of the pgbackrest repo, become necessary to adhere to SLOs as well as system function.

We do all of the high-complexity, high-throughput things recommended by postgres, and it's barely enough on the big systems. These things are getting to the point of needing a lot more storage and network bandwidth.


This was my understanding as well, color me also confused.

No, competent systems just need to have something that, taken together, prevents data corruption.

One possible instance of that is a database providing its own data checksumming, but another perfectly valid one is running one that does not on a lower layer with a sufficiently low data corruption rate.


> ZFS is usually not recommended for databases

Say more? I've heard people say that ZFS is somewhat slower than, say, ext4, but I've personally had zero issues running postgres on zfs, nor have I heard any well-reasoned reasons not to.

> What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.

Sorry, what? Just yesterday matrix.org had a post about how they (using ext4 + postgres) had disk corruption which led to postgres returning garbage data: https://matrix.org/blog/2025/07/postgres-corruption-postmort...

The corruption was likely present for months or years, and postgres didn't notice.

ZFS, on the other hand, would have noticed during a weekly scrub and complained loudly, letting you know a disk had an error, letting you attempt to repair it if you used RAID, etc.

It's stuff like in that post that are exactly why I run postgres on ZFS.

If you've got specifics about what you mean by "databases should bring their own implementation", I'd be happy to hear it, but I'm having trouble thinking of any sorta technically sound reason for "databases actually prefer it if filesystems can silently corrupt data lol" being true.


SQLite on ZFS needs the Fsync behaviour to be off, otherwise SQLite will randomly hang the application as the fsync will wait for the txg to commit. This can take a minute or two, in my experience.

Btrfs is a better choice for SQLite.


Btw this concern also applies to other databases, although probably it manifests in the worst way in SQLite. Essentially, you’re doing a WAL over the file systems’ own WAL-like recovery mechanism.

> SQLite on ZFS needs the Fsync behaviour to be off […]

    zfs set sync=disabled mydata/mydb001
* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...

The point is that a database cannot rely on being deployed on a filesystem with proper checksums.

Ext4 uses 16-/32-bit CRCs, which is very weak for storage integrity in 2025. Many popular filesystems for databases are similarly weak. Even if they have a strong option, the strong option is not enabled by default. In real-world Linux environments, the assumption that the filesystem has weak checksums usually true.

Postgres has (IIRC) 32-bit CRCs but they are not enabled by default. That is also much weaker than you would expect from a modern database. Open source databases do not have a good track record of providing robust corruption detection generally nor the filesystems they often run on. It is a systemic problem.

ZFS doesn't support features that high-performance database kernels use and is slow, particularly on high-performance storage. Postgres does not use any of those features, so it matters less if that is your database. XFS has traditionally been the preferred filesystem for databases on Linux and Ext4 will work. Increasingly, databases don't use external filesystems at all.


I know MySQL has checksums by default, how does it compare? Is it useful or is it similarly weak?

I don't know but LLMs seem to think it uses a 32-bit CRC like e.g. Postgres.

In fairness, 32-bit CRCs were the standard 20+ years ago. That is why all the old software uses them and CPUs have hardware support for computing them. It is a legacy thing that just isn't a great choice in 2025.


Is not great for databases that do updates in place. Log-structured merge databases (which most newer DB engines are) work fine with its copy-on-write semantics.

I believe it's also because of this (from https://www.sqlite.org/wal.html):

> [...] The checkpoint does not normally truncate the WAL file (unless the journal_size_limit pragma is set). Instead, it merely causes SQLite to start overwriting the WAL file from the beginning. This is done because it is normally faster to overwrite an existing file than to append.

Without the checksum, a new WAL entry might cleanly overwrite an existing longer one in a way that still looks valid (e.g. "A|B" -> "C|B" instead of "AB" -> "C|data corruption"), at least without doing an (expensive) scheme of overwriting B with invalid data, fsyncing, and then overwriting A with C and fsyncing again.

In other words, the checksum allows an optimized write path with fewer expensive fsync/truncate operations; it's not a sudden expression of mistrust of lower layers that doesn't exist in the non-WAL path.


According to the TNG Technical Manual (which is not entirely canon, but whatever) it's all handwaved by a context-sensitive AI that figures out who you're trying to talk to from context clues.

https://scifi.stackexchange.com/questions/55156/why-is-there...


> "The retrograde orbital plane... of 3I/ATLAS around the Sun lies within 5 degrees of that of Earth... The likelihood for that coincidence out of all random orientations is 0.2%." Not sure where he comes up with 0.2%.

This part of the calculation, at least, is basically correct. The orientation of a plane in space is defined by its normal vector, so the right way to look at probabilities is in terms of solid angle. The normal of 3I/ATLAS's orbit falls within a cone around Earth's normal vector, having a half-angle of 5 degrees, and that cone's solid angle occupies about 0.2% of the full sphere.

Of course, this is only the chance of a retrograde alignment. Presumably, if the comet's orbit was prograde aligned with the Earth's to within 5 degrees, Loeb would be making exactly the same claim. So really, the relevant probability is 0.4%.

Nevertheless, I agree that the article is basically just a bunch of cherry-picked probabilities and insinuations that don't add up to much.

Also:

> "the brightness of 3I/ATLAS implies an object that is ~20 kilometers in diameter (for a typical albedo of ~5%), too large for an interstellar asteroid."

But to justify this, Loeb cites his own work showing that the object is either a large asteroid, or a comet with a small nucleus. And then he seems to have looked at some earlier spectra and jumped to the conclusion that 3I/ATLAS couldn't be a comet, so it must be a large asteroid. But of course, follow-up observations have debunked this point and clearly shown it to be a comet.


I think there's also a sampling bias here? ATLAS, the survey that discovered the comet, is specifically looking for potential Earth impactors. One assumes that would involve looking close to Earth's own orbital plane.

I'm no Hubble expert, but a bit of research turned up the "HST Primer" [1] which is apparently up-to-date for the current observing cycle, and which says:

> HST is capable of tracking moving targets with the same precision achieved for fixed targets. This is accomplished by maintaining FGS Fine Lock on guide stars and driving the FGS star sensors in the appropriate path, thus moving the telescope to track the target. Tracking under FGS control is technically possible for apparent target motions up to 5 arcsec/s.

According to JPL Horizons, the current angular motion of 3I/ATLAS across the sky is <0.03 arcsec/s, so it's well within Hubble's capabilities.

My understanding is that the Hubble's one-gyro mode mainly complicates the process of quickly moving from one target to another. Once the telescope is pointed at a target, the stabilization and tracking is done using guide stars without relying on gyros.

Anyway, in absolute terms, 3I/ATLAS isn't moving that fast. Its orbital speed is about 3x that of Mars, but it's farther away, and (for now) much of that motion is directed inward towards the sun.

[1]: https://hst-docs.stsci.edu/hsp/the-hubble-space-telescope-pr...


You're on the money. Hubble isn't too perturbed when it can lock on to guide stars, but some observations are no longer possible in one-gyro mode.

If you recognize that sign languages such as ASL are distinct languages, as linguists do, then it naturally makes sense that native speakers of those languages would want a way to write them down in a static symbolic way, for all the same practical reasons that we use the Latin alphabet in English.

For instance, being able to quickly scan through a piece of text instead of having to watch it play in video form, or being able to search and index it, or providing a way to organize dictionaries.

There's no inherent problem with using the same notation scheme for different sign languages, just like we use essentially the same alphabet for English, Spanish, French, German, etc.


And as soon as all of the major players are using the same algorithmic system to set their prices, they are effectively a cartel and any prospect of real competition disappears. It's exactly the same situation we're seeing with RealPage in the real estate industry.

And if anyone objects or raises any claim they'll just shrug and say it's those darn algorithms again.

https://www.hklaw.com/en/insights/publications/2025/01/the-l...

The government is taking the realpage issue seriously.


If this theory were correct, why are margins on flying sub-5%?

This kind of collusion works very well for supply-constrained apartment rentals with high switching costs, but works less well for commodities.


But if you drop the assumption that each site only writes rows prefixed with its site ID, then you're right back to the original situation where writes can be silently overwritten.

Do you consider that acceptable, or don't you?


Not silently overwritten: the collision is visible to the application layer once connectivity is restored and you can prompt humans to reconcile it if need be.

The Java bytecode instruction set actually has a quite complicated specification: https://docs.oracle.com/javase/specs/jvms/se8/html/

Which means implementations also have to be correspondingly complicated. You have to handle quite a few different primitive data types each with their own opcodes, class hierarchies, method resolution (including overloading), a "constant pool" per class, garbage collection, exception handling, ...

I would expect a minimal JVM that can actually run real code generated by a Java compiler to require at least 10x as much code as a minimal Bedrock VM, and probably closer to 100x.


Neat. I wonder if this was an inspiration for Guy Steele's famous "Growing a Language" talk, or if he came up with the same idea independently.

https://www.youtube.com/watch?v=lw6TaiXzHAE


Fantastic link. Does anyone know how he produced his animated slides and how they were controlled during the talk?


You made my day with this link


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: