Fast commits for ext4

Posted Jan 18, 2021 19:24 UTC (Mon) by tytso (subscriber, #9993)
In reply to: Fast commits for ext4 by NYKevin
Parent article: Fast commits for ext4

So for decades, competently written text editors write new precious files, such as source files via:

1) Write the new contents of foo.c to foo.c.new
2) Fsync foo.c.new --- check the error return from the fsync(2) as well as the close(2)
3) Delete foo.c.bak
4) Create a hard link from foo.c to foo.c.bak
5) Rename foo.c.new on top of foo.c

This doesn't require an fsync of the directory, but it guarantees that /path/to/foo.c will either have the original contents of foo.c., or the new contents of foo.c, even if there is a crash any time during the above process. If you want portability to other Posix operating systems, including people running, say, retro versions of BSD 4.3, this is what you should do. It's what emacs and vi does, and some of the "ritual", such as making sure you check the error return from close(2), is because other wise you might lose data if you run into a quota overrun on the Andrew File System (the distributed file system developed at CMU, and used at MIT Project Athena, as well as several National Labs and financial institutions).

That being said, rename is not a write barrier, but as part of the O_PONIES discussion, on a close(2) of an open which was opened with O_TRUNC, or on a rename(2) where the destination file is getting overwritten, the file being closed, or the source file of the rename will have an immediate write-out initiated. It's not going to block the rename(2) operation from returning, but it narrows the race window from 30 seconds to however long it takes to accomplish the writeout, which is typically less than a second. It's also something that was implemented informally by all of the major file systems at the time of the O_PONIES controversy, but it doesn't necessarily account for what newer file systems (for example, like bcachefs and f2fs) might decide to do, and of course, this is not applicable for what other operating systems such as MacOS might be doing.

The compromise is something that was designed to minimize performance impact, since users and applications also depend upon --- and get cranky --- when there are performance regressions, while still papering over most of the problems caused by careless application. From file system developers' perspective, the ultimate responsibility is on application writers if they think a particular file write is precious and must lost be lost after a system or application crash. After all, if the application is doing something really stupid, such as overwriting a precious file by using open(2) with O_TRUNC, because it's too much of a pain to copy over ACL's and extended attributes, so it's simpler to just use O_TRUNC and overwrite the data file and crossing your fingers. There is absolutely no way the file system can protect against application writer stupidity, but we can try to minimize the risk of damage, while not penalizing the performance of applications which are doing the right thing, and are writing, say, a scratch file.