Return to the Front page

LWN.net Weekly Edition for April 3, 2014

LWN.net Weekly Edition for March 27, 2014

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

PostgreSQL pain points

By Jonathan Corbet
March 26, 2014

2014 LSFMM Summit

The kernel has to work for a wide range of workloads; it is arguably unsurprising that it does not always perform as well as some user communities would like. One community that sometimes felt left out in the cold is the PostgreSQL relational database management system project. In response to an invitation from the organizers of the 2014 Linux Storage, Filesystem, and Memory Management summit, PostgreSQL developers Robert Haas, Andres Freund, and Josh Berkus came to discuss their worst pain points and possible solutions.

PostgreSQL is an old system, dating back to 1996; it has a lot of users running on a wide variety of operating systems. So the PostgreSQL developers are limited in the amount of Linux-specific code they can add. It is based on cooperating processes; threads are not used. System V shared memory is used for interprocess communication. Importantly, PostgreSQL maintains its own internal buffer cache, but also uses buffered I/O to move data to and from disk. This combination of buffering leads to a number of problems experienced by PostgreSQL users.

Slow sync

The first problem described is related to how data gets to disk from the buffer cache. PostgreSQL uses a form of journaling that they call "write-ahead logging". Changes are first written to the log; once the log is safely on disk, the main database blocks can be written back. Much of this work is done in a "checkpoint" process; it writes log entries, then flushes a bunch of data back to various files on disk. The logging writes are relatively small and contiguous; they work fairly well, and, according to Andres, the PostgreSQL developers are happy enough with how that part of the system works on Linux.

The data writes are another story. The checkpoint process paces those writes to avoid overwhelming the I/O subsystem. But, when it gets around to calling fsync() to ensure that the data is safely written, all of those carefully paced writes are flushed into the request queue at once and an I/O storm results. The problem, they said, is not that fsync() is too slow; instead, it is too fast. It dumps so much data into the I/O subsystem that everything else, including read requests from applications, is blocked. That creates pain for users and, thus, for PostgreSQL developers.

Ted Ts'o asked whether the ability to limit the checkpoint process to a specific percentage of the available I/O bandwidth would help. But Robert responded that I/O priorities would be better; the checkpoint process should be able to use 100% of the bandwidth if nothing else wants it. Use of the ionice mechanism (which controls I/O priorities in the CFQ scheduler) was suggested, but there is a problem: it does not work for I/O initiated from an fsync() call. Even if the data was written from the checkpoint process — which is not always the case — priorities are not applied when the actual I/O is started by fsync().

Ric Wheeler suggested that the PostgreSQL developers need to better control the speed with which they write data; Chris Mason added that the O_DATASYNC option could be used to give better control over when the I/O requests are generated. The problem here is that such approaches require PostgreSQL to know about the speed of the storage device.

The discussion turned back to I/O priorities, at which point it was pointed out that those priorities can only be enforced in the request queue maintained by the I/O scheduler. Some schedulers, including those favored by PostgreSQL users (who tend to avoid the CFQ scheduler), do not implement I/O priorities at all. But, even those that do support I/O priorities place limits on the length of the request queue. A big flush of data will quickly fill the queue, at which point I/O priorities lose most of their effectiveness; a high-priority request will still languish if there is no room for it in the request queue. So, it seems, I/O priorities are not the solution to the problem.

It's not clear what the right solution is. Ted asked if the PostgreSQL developers could provide a small program that would generate the kind of I/O patterns created by a running database. Given a way to reproduce the problem easily, kernel developers could experiment with different approaches to a solution. This "program" may take the form of a configuration script for PostgreSQL initially, but a separate (small) program is what the kernel community would really like to see.

Double buffering

PostgreSQL needs to do its own buffering; it also needs to use buffered I/O for a number of reasons. That leads to a problem, though: database data tends to be stored in memory twice, once in the PostgreSQL buffer, and once in the page cache. That increases the amount of memory used by PostgreSQL considerably, to the detriment of the system as a whole.

Much of that memory waste could conceivably be eliminated. Consider, for example, a dirty buffer in the PostgreSQL cache. It is more current than any version of that data that the kernel might have in the page cache; the only thing that will ever happen to the page cache copy is that it will be overwritten when PostgreSQL flushes the dirty buffer. So there is no value to keeping that data in the page cache. In this case, it would be nice if PostgreSQL could tell the kernel to remove the pages of interest from the page cache, but there is currently no good API for that. Calling fadvise() with FADV_DONTNEED can, according to Andres, actually cause pages to be read in; nobody quite understood this behavior, but all agreed it shouldn't work that way. They can't use madvise() without mapping the files; doing that in possibly hundreds of processes tends to be very slow.

It would also be nice to be able move pages in the opposite direction: PostgreSQL might want to remove a clean page from its own cache, but leave a copy in the page cache. That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache. There was some talk of the form such an interface might take, but this part of the discussion eventually wound down without any firm conclusions.

Regressions

Another problem frequently experienced by PostgreSQL users is that recent kernel features tend to create performance problems. For example, the transparent huge pages feature tends not to bring much benefit to PostgreSQL workloads, but it slows them down significantly. Evidently a lot of time goes into the compaction code, which is working hard without actually producing a lot of free huge pages. In many systems, terrible performance problems simply vanish when transparent huge pages are turned off.

Mel Gorman answered that, if compaction is hurting performance, it's a bug. That said, he hasn't seen any transparent huge page bugs for quite some time. There is, he said, a patch out there which puts a limit on the number of processes that can be performing compaction at any given time. It has not been merged, though, because nobody has ever seen a workload where too many processes running compaction was a problem. It might, he suggested, be time to revisit that particular patch.

Another source of pain is the "zone reclaim" feature, whereby the kernel reclaims pages from some zones even if the system as a whole is not short of memory. Zone reclaim can slow down PostgreSQL workloads; usually the best thing to do on a PostgreSQL server is to simply disable the feature altogether. Andres noted that he has been called in as a consultant many times to deal with performance problems related to zone reclaim; it has, he said been a good money-maker for him. Still, it would be good if the problem were to be fixed.

Mel noted that the zone reclaim mode was written under the assumption that all processes in the system would fit into a single NUMA node. That assumption no longer makes sense; it's long past time, he said, that this option's default changed to "off." There seemed to be no opposition to that idea in the room, so a change may happen sometime in the relatively near future.

Finally, the PostgreSQL developers noted that, in general, kernel upgrades tend to be scary. The performance characteristics of Linux kernels tend to vary widely from one release to the next; that makes upgrades into an uncertain affair. There was some talk of finding ways to run PostgreSQL benchmarks on new kernels, but no definite conclusions were reached. As a whole, though, developers for both projects were happy with how the conversation came out; if nothing else, it represents a new level of communication between the two projects.

[Your editor would like to thank the Linux Foundation for supporting his travel to the LSFMM summit.]

(Log in to post comments)

PostgreSQL pain points

Posted Mar 26, 2014 6:08 UTC (Wed) by kev009 (subscriber, #43906) [Link]

I guess it's hard to tell from TFA, but the Linux side seemed kind of stand-offish, especially the slow sync parts?

PostgreSQL pain points

Posted Mar 26, 2014 9:02 UTC (Wed) by iq-0 (subscriber, #36655) [Link]

My guess would be that that's because it's a very tricky topic. I think everybody knows that something should be fixed there, but that nobody has come up with a possible fix that wouldn't cause major regressions in other workloads.

But I think anybody would agree that a mere fsync() (or similar operation) should cause this extreme form of starvation.

I'd guess this is also similar to stalls seen when writing to a slow USB device (those eventually turn into forced syncs because of dirty limits induced forced writeback and case similar starvation). But I don't know if those code paths are similar enough in that fixing one will also directly benefit the other.

PostgreSQL pain points

Posted Mar 26, 2014 16:35 UTC (Wed) by Lennie (subscriber, #49641) [Link]

"I'd guess this is also similar to stalls seen when writing to a slow USB device (those eventually turn into forced syncs because of dirty limits induced forced writeback and case similar starvation). But I don't know if those code paths are similar enough in that fixing one will also directly benefit the other."

I've seen people suggest this might be solved by the Multi-Queue Block layer (blk-mq) as merged in 3.13, does anyone know if this is true ?

PostgreSQL pain points

Posted Mar 26, 2014 12:52 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

I was kinda expecting to read something like: "There was a meeting of PostgreSQL and kernel developers. There were no survivors of the bloodbath that had ensued".

PostgreSQL pain points

Posted Mar 27, 2014 15:00 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

"All went swimmingly until a kernel dev jokingly suggested mySQL as a work-around. . ."

PostgreSQL pain points

Posted Mar 28, 2014 21:25 UTC (Fri) by bronson (subscriber, #4806) [Link]

... and a Postgres dev retorted that FreeBSD didn't exhibit any of these problems.

PostgreSQL pain points

Posted Apr 3, 2014 9:54 UTC (Thu) by Pawlerson (guest, #74136) [Link]

Use Linux and MySQL. Problems solved. If PostgreSQL has problems its developers should solve them. If they care..

PostgreSQL pain points

Posted Apr 3, 2014 14:24 UTC (Thu) by olgeni (guest, #96387) [Link]

Use FreeBSD and PostgreSQL. Problems solved, and you also get a proper database for free.

PostgreSQL pain points

Posted Apr 3, 2014 18:31 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

you act as if MySQL doesn't run into the same problems. I think you just don't understand the details well enough.

PostgreSQL pain points

Posted Apr 3, 2014 18:32 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

MySQL avoids the problem by just dropping your data instead ;) . You prefer fast over correctness, right?

PostgreSQL pain points

Posted Apr 3, 2014 20:22 UTC (Thu) by rodgerd (guest, #58896) [Link]

And instead you get the problem of your data being quietly corrupted.

PostgreSQL pain points

Posted Mar 26, 2014 16:39 UTC (Wed) by skitching (subscriber, #36856) [Link]

It's a real problem, but the solution isn't obvious and there is no way for kernel developers to reproduce the problem locally. Tackling a problem like that wouldn't fill me with enthusiasm either.

Seems to me that the priority for the postgresql team is to provide an easy way to trigger these problems - ie as suggested, a small app that generates the access patterns that postgresql does. I bet that if such an app were part of one of the regular kernel test-suites then the problem would be fixed and never come back.

PostgreSQL pain points

Posted Mar 26, 2014 18:13 UTC (Wed) by jberkus (subscriber, #55561) [Link]

That wasn't my impression from being in the room. Rather, it seemed like nobody at LSF/MM had really looked at fsync() choke before, so they didn't have any good suggestions on how to fix it.

(Jon: Nice write-up! I couldn't have done it better)

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 14:46 UTC (Wed) by jhhaller (subscriber, #56103) [Link]

The problem is with the dirty buffer caches which can grow quite large compared with the storage I/O capacity, and the limited mechanisms to flush dirty pages to disk. While the amount of dirty pages which can exist without being flushed has been reduced recently, it's still quite large. For a simple example, copy a huge file across a network while watching the disk activity lights. Even with a copy running at 1 Gbps, the disk shows no activity for several seconds, then is solid on for several seconds, and the pattern repeats. While I understand that for some use cases, such as tmp files, that not writing the files which will shortly be deleted is desirable, there are cases which will need to be persisted. For the persistent cases, it would be desirable to start writing dirty pages as they fill, as the lowest priority I/O. Then, when fsync is called, there should be few dirty blocks which need to be written. Perhaps a F_MINIMIZEDIRTYBLOCKS fcntl option is in order.

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 17:59 UTC (Wed) by roblucid (subscriber, #48964) [Link]

There can be large avantages to deferring writes, you can allocate decent number of blocks contiguously for instance, leaving i/o bandwidth unused tends to be a 'win' even when it's not temporary files that don't ever reach the disk platters due to reductions in disk seeks.

One past problem with fsync() has been it NOT syncing only the file in question, but all dirty blocks for the whole filesytem, which is likely NOT what a RDBMS wants, yet the most feasible implementation for the call. If the FS implementations can't fix that behaviour, how's any other file attribute, going to help the situation?

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 18:45 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

> One past problem with fsync() has been it NOT syncing only the file in question, but all dirty blocks for the whole filesytem, which is likely NOT what a RDBMS wants, yet the most feasible implementation for the call. If the FS implementations can't fix that behaviour, how's any other file attribute, going to help the situation?

As I understand it, this is an ext3 bug, the other filesystems don't behave this way.

Dirty pages, faster writing and fsync

Posted Mar 26, 2014 20:08 UTC (Wed) by jhhaller (subscriber, #56103) [Link]

I agree that coalescing writes have advantages, but when a single fd has written 2-3GB to the buffer cache and it's only memory pressure forcing them to be written, that we are long past where the write coalescing benefit is useful. In the case of PostgreSQL, the pages should be written relatively quickly, as fsync is coming, and there is no point waiting for it. Linux doesn't behave well during either an explicit fsync or an implicit one caused by memory pressure. Another example, mythtv, calls fsync once per second while recording TV just to avoid a huge file operations delayed when memory pressure causes an implicit sync operation.

Having too many contiguous blocks written at once is another problem, as the drive will be busy writing all those blocks for as long as it takes. Once the write exceeds the amount in one cylinder (not that we can really tell the geometry), it has to seek anyway, and might as well let something else use the disk. Otherwise we have the storage equivalent of bufferbloat, where high priority writes get backed up behind a huge transfer.

Dirty pages, faster writing and fsync

Posted Mar 27, 2014 10:45 UTC (Thu) by dgm (subscriber, #49227) [Link]

As with bufferbloat, one approach could be to measure the buffer cache in terms of time it takes to write it back, instead of bytes and megabytes.

As others have suggested, having multiple queues mapped somehow to ionice levels, could be of help too.

Dirty pages, faster writing and fsync

Posted Mar 27, 2014 9:16 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

The trick is performing I/O in the background but not saturating the I/O queue (you really want it to be in the background). Efficiency only becomes an issue when there is contention.
So sure you write some stuff that didn't have to be written to disk. And sure it could be done more efficiently later on. But the benefit is that it's no longer dirty in memory and that is our scarce resource in this scenario (as long as I/O is uncontended in that case that becomes the scarce resource).

sync() starves other reader/writers

Posted Mar 26, 2014 15:15 UTC (Wed) by seanyoung (subscriber, #28711) [Link]

I've come across a similar before. Sometimes you don't want: "write this to file asap", sometimes you want "with normal priority write this to disk and let me know when you're done".

Here is my proposal for msync() (which was where I used it).

msync(MS_WRITE): write these pages to disk, block until done
msync(MS_AWRITE): write these pages to disk, return immediately

However you might want to submit a whole set of writes, let the block layer reorder them and know when they are done. The aio functions look like they can provide this, although I have not tried them.

If I understand the aio functions correctly they can avoid the page cache so this would also help with the double buffering problem.

sync() starves other reader/writers

Posted Mar 26, 2014 23:24 UTC (Wed) by MrWim (subscriber, #47432) [Link]

I've similarly used sync_file_range and posix_fadvise (LWN article) to avoid trashing caches while timeshifting live TV on the YouView set-top box. It seemed to work well and avoided all the gotchas associated with O_DIRECT. It was based on the advice from Linus on how to copy files avoid cache trashing and without O_DIRECT. You can find the documentation and code for this on github.

sync() starves other reader/writers

Posted Mar 27, 2014 10:52 UTC (Thu) by seanyoung (subscriber, #28711) [Link]

Thank you for that.

The problem postgresql has is that you want the wal/journal to written asap (so with fsync). You want the database files to be written WITHOUT fsync, but you do want to know when when they complete/are persisted, so you can safely discard old wal/journal files for examples.

So what I was trying to avoid was calling fsync(). sync_file_range() just ends up calling fsync in the file system driver, it is no different, as you have done in your code.

I'm not sure there is an method for this, although the aio functions seem to provide an API for this.

sync() starves other reader/writers

Posted Mar 27, 2014 18:06 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

Plus you want to make sure none of the database files get written ahead of the corresponding WAL data

sync() starves other reader/writers

Posted Mar 27, 2014 19:44 UTC (Thu) by seanyoung (subscriber, #28711) [Link]

Indeed, there are two solutions to this:

1) write the journal before you do anything (requires repeating the operations twice)
2) Before modifying a database page, ensure that any scheduled i/o has completed. If not either copy the page or move on to other pending work.

So ideally you want completion information on page level for non-fsync writes.

sync() starves other reader/writers

Posted Mar 27, 2014 20:05 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

given that your transaction is likely to affect multiple database pages, #2 isn't viable, you can never guarantee that all or none of the transaction will be visible after a crash. That's why the databases do #1

PostgreSQL pain points

Posted Mar 26, 2014 19:07 UTC (Wed) by marcH (subscriber, #57642) [Link]

This all looks like a defence of the "raw partition" database approach.

There are differences between databases and filesystems, but that not much. So, it's not a massive surprise that layering one on top of the other is causing "virtualization-like" problems.

"...except for the problem of too many layers of indirection."

PostgreSQL pain points

Posted Mar 26, 2014 20:37 UTC (Wed) by jberkus (subscriber, #55561) [Link]

As with most things, using raw partitions exchanges one set of problems for another:

* Database blocks become difficult to resize and move.
* Can no longer use standard tools like "rsync" with database files.
* The database project now needs a staff to maintain what's basically their own filesystem
* Can't keep up with hardware advances in a timely fashion.
* Throwing away all of the good stuff developed by Linux IO and FS geeks over the last 20 years.
* Clobbering all other IO-using software on the same machine.

For Postgres, raw partitions aren't even reasonable to contemplate since we'd need to add 5-10 full-time hackers to the community just to build and maintain that portion of the code.

virtualization

Posted Mar 26, 2014 22:52 UTC (Wed) by marcH (subscriber, #57642) [Link]

> * The database project now needs a staff to maintain what's basically their own filesystem

Yes, agreed totally.

> Throwing away all of the good stuff developed by Linux IO and FS geeks over the last 20 years.

I understand all the "it's more [development] work" arguments that you put in one form or the other. Yes for sure it is: exactly like the duplication of effort we have in the variety of filesytems (on various operating systems) that we have out there. Some better at some loads and others at others.

> * Database blocks become difficult to resize and move.

Yes, "virtualization"/layering has pros and cons. But if you really want "bare-metal" performance you know where you have to go.

> * Can no longer use standard tools like "rsync" with database files.

Well, you can't use that on a live database anyway, so this point looks moot. Unless maybe you rely on a filesystem with snapshotting which is... not far from duplicating a database feature! Same pattern gain.

> * Can't keep up with hardware advances in a timely fashion.
> * Clobbering all other IO-using software on the same machine.

Sorry I don't get these two. Care to elaborate?

virtualization

Posted Mar 27, 2014 18:58 UTC (Thu) by kleptog (subscriber, #1183) [Link]

> > * Can no longer use standard tools like "rsync" with database files.

> Well, you can't use that on a live database anyway, so this point looks moot. Unless maybe you rely on a filesystem with snapshotting which is... not far from duplicating a database feature! Same pattern gain.

You can. It's useful for both backup and replication. Basically you can use rsync to quickly update your backup image. And then you take a copy of the WAL logs. The combination gives you a backup. If you have a snapshotting filesystem you can indeed achieve similar effects.

> > * Can't keep up with hardware advances in a timely fashion.
> > * Clobbering all other IO-using software on the same machine.

> Sorry I don't get these two. Care to elaborate

For the first, consider the effects the rise of SSD is having on the Linux VFS. That would need to be replicated in the database. For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

It's a feature that a database doesn't assume it's the only program on a machine.

virtualization

Posted Mar 27, 2014 19:32 UTC (Thu) by marcH (subscriber, #57642) [Link]

> > * Clobbering all other IO-using software on the same machine.

> For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.

Anyway: any database of serious size runs on dedicated or practically dedicated hardware, doesn't it?

virtualization

Posted Mar 28, 2014 22:24 UTC (Fri) by kleptog (subscriber, #1183) [Link]

> > > * Clobbering all other IO-using software on the same machine.

> > For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

> How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.

I think it depends on what your goals are. If your goal is to make the absolutely fastest database server possible, then you'd probably want to use raw access on a system with nothing else running.

If your goal is to make a database server that is broadly useful, runs efficiently on a wide variety of systems then asking the kernel to do its job is the better idea.

PostgreSQL tends to the latter. The gains you can get from raw access are simply not worth the effort and would make PostgreSQL much harder to deploy in many situations. A database server that only works well when it's got the machine to itself is a PITA in many situations.

virtualization

Posted Mar 29, 2014 3:37 UTC (Sat) by fandingo (subscriber, #67019) [Link]

> How is the raw partition approach worse here?

I'm not sure a database should be implementing operations necessary for ATA TRIM.

virtualization

Posted Mar 27, 2014 19:36 UTC (Thu) by marcH (subscriber, #57642) [Link]

> The combination gives you a backup

Heh, that was missing.

I am still not convinced that rsync is the ultimate database backup tool. As much as I love rsync it surely does not have the patented exclusivity of incremental copying/backup techniques.

PostgreSQL pain points

Posted Mar 27, 2014 3:34 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

> Database blocks become difficult to resize and move.

LVM.

> Can't keep up with hardware advances in a timely fashion.

Most of that happens below the block device level, so filesystems and raw partitions get it at the same time.

> Can no longer use standard tools like "rsync" with database files.

Databases tend to have their own. You often can't use rsync with a live database file on a filesystem either.

> The database project now needs a staff to maintain what's basically their own filesystem

That private filesystem doesn't have to do much that the database wasn't doing already. You could skip an indirection layer.

Unlike a filesystem, a database is not required to support legacy on-disk data across major releases (your DBA must replicate, dump/restore, or in-place upgrade instead). This means the private database filesystem could adapt more quickly to changes in storage technology compared to a kernel filesystem.

> Throwing away all of the good stuff developed by Linux IO and FS geeks over the last 20 years.

You are assuming that FS geeks are developing stuff that is relevant for databases. A database might be better off freed from the constraints of living with a filesystem layer (and legacy filesystem feature costs) between it and its storage.

OTOH a filesystem might be better after all--but that has to be proven, not assumed.

> Clobbering all other IO-using software on the same machine.

That's also true in the filesystem case.

PostgreSQL pain points

Posted Mar 27, 2014 6:22 UTC (Thu) by amacater (subscriber, #790) [Link]

A database might be better off ...

In a slightly different context - IBM Clearcase did/does something similar.
Softtware snapshotting and versioning by intercepting file system calls and writing to a custom intermediate file system level.

Result: everyone's worst nightmare if a large disk fails - IBM _might_ be able to recover your life's work if you can send them the entire filesystem.

And yes, dirty pages and flushing are fun :(

PostgreSQL pain points

Posted Mar 27, 2014 8:00 UTC (Thu) by marcH (subscriber, #57642) [Link]

A database needs a backup strategy built-in anyway.

About ClearCase: anyone who has used it (and used other things) knows it was one of the worst pieces of engineering ever. So, if you want to be convincing I suggest not using it as an example in any point you are trying to make.

PostgreSQL pain points

Posted Mar 28, 2014 1:02 UTC (Fri) by rodgerd (guest, #58896) [Link]

Not to mention it's a huge pain in the arse as a sysadmin. My experience is predominantly with Oracle's raw filesystem (ASM), and the lack of standard tooling to manage the filesystem... it's just not a good thing. To the point where our DBAs will avoid ASM unless they're building a RAC.

Probably the most annoying experience was discovering that ASM doesn't do any kind of sanity check when starting filesystems: after a SAN change, the SAN operator wired some LUNs back to the wrong boxes. With an AAAABAAAA on one server and BBBBABBBB on another, LVM would have simply refused to start the volume group. ASM started and then Oracle would silently coredump every time it tried to work with data on the misplaced disk. Such as the horrors of re-inventing decades of volume management and filesystems.

PostgreSQL pain points

Posted Mar 28, 2014 8:31 UTC (Fri) by marcH (subscriber, #57642) [Link]

Yet there is a *variety* of filesystems developed, even just on Linux. How come?

Wild, poorly educated guess: in an ideal world, shouldn't databases be hosted on a trimmed down, "semi-filesystem" which has only management features and none of the duplicated performance stuff which gets in the way? It could be called say, LLVM++ for instance?

PostgreSQL pain points

Posted Mar 29, 2014 13:56 UTC (Sat) by kleptog (subscriber, #1183) [Link]

That might be an idea, except at the moment we're still at the API phase: what information needs to be communicated and how? Stuff mentioned in the article like: order dependencies between blocks, indicating that the kernel cache is out of date, etc. Whether it's eventually implemented in the VFS or as ioctl()s on a fancy new filesystem is really beside the point.

I wonder if finding a way to expose parts of the JBD (Journalling Block Device) to user space might help with any of this.

PostgreSQL pain points

Posted Mar 26, 2014 22:31 UTC (Wed) by rodgerd (guest, #58896) [Link]

This sounds like "too hard, go away." If a database engine exposes shortcomings in the Linux VFS layer, the best thing for everybody would be to make the VFS better. Otherwise sooner or later people will decide to do things that require performance in that area elsewhere.

PostgreSQL pain points

Posted Mar 26, 2014 23:39 UTC (Wed) by intgr (subscriber, #39733) [Link]

> This all looks like a defence of the "raw partition" database approach.

If you have enormous budgets like Oracle, then sure. But there's no end of improvements to be made to other areas of PostgreSQL, the cost/benefit for reimplementing a filesystem inside the database isn't so great. Truth is, the kernel I/O is already good enough for most users.

MySQL developers have been tweaking InnoDB's own storage, buffering and I/O layers for the longest time (which uses O_DIRECT, bypassing most of the kernel), but there are still I/O intensive workloads where PostgreSQL beats MySQL. There are more examples to the contrary, I'm sure, but it's not as clear of a win as it might sound from the article.

Dirty page caching not mostly harmless

Posted Mar 26, 2014 20:36 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

My experience is that "less is better" for cached dirty pages in the kernel. There needs to be enough to avoid writing small short-lived temporary files to disk--and no more. 0.1% of RAM on a modern desktop is plenty, 1% is too much. I think the default is 20% or so, which allows a "burst" of over an hour of mostly uninterruptible writing--and if a process is continuing to make more dirty pages faster than they can be cleaned, such writes can last indefinitely.

I've seen 'mkdir' take 45 minutes on a small disk array with a persistent write load. Before I figured out the problem was related to dirty page caching, I had watchdog resets on machines that could not complete a single rename() call in less than one hour.

The use cases that trigger write-related latency problems aren't hard to find. 'tar xvvzf huge_tarball.tgz' will do nicely, 'head -c 50g /dev/zero > bigfile' or 'mv /fast-disk/bigfile /removable-slow-disk/' will too. I have to subject myself to considerable inconvenience to *avoid* stumbling across such cases.

To help contain the performance damage from write-heavy workloads I have used cgroups (limiting dirty buffers by restricting the total amount of RAM a cgroup can use for cache) and vm.dirty_background_bytes (which forces background page writes to start almost immediately, but allows the cleaned pages to stay around for cached reads later). I've also tried ionice, hugepages, cgroup blkio throttling, and a dozen other /sys/block and /proc/sys/vm parameter settings, but none of those seemed to do anything useful, and many of them made things much worse.

These parameter settings are treating the symptom rather than solving the problem, but the symptoms are pretty severe.

Dirty page caching not mostly harmless

Posted Mar 26, 2014 20:46 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

> I think the default is 20% or so

I beleive it used to be 20% but is now 5% or so (or based on RAM size)

> if a process is continuing to make more dirty pages faster than they can be cleaned, such writes can last indefinitely.

well, that's true no matter what your limit is.

with ext3 and it's bug in needing to write all dirty data out with an fsync, such a situation can cause your fsync to take forever.

> These parameter settings are treating the symptom rather than solving the problem, but the symptoms are pretty severe.

yep, and there are periodic discussions on linux-kernel, but there hasn't been a definitive solution yet.

Dirty page caching not mostly harmless

Posted Mar 27, 2014 8:07 UTC (Thu) by marcH (subscriber, #57642) [Link]

> > I think the default is 20% or so

> I beleive it used to be 20% but is now 5% or so (or based on RAM size)

Is this [disk] bufferbloat here again? If yes there was a discussion about it on this site recently.

Without getting into detail and duplicating existing discussions (thanks for pointers) should the ideal solution be based on time just like for network bufferbloat, as opposed to RAM size now?

Dirty page caching not mostly harmless

Posted Mar 27, 2014 8:20 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

probably, there has been a lot of talk over the last couple of years that the disk I/O needs to resemble network I/O a lot more. It used to be that the spinning rust was so slow compared to the rest of the system that is was worth just about any computational effort to optimize it. But with large arrays and SSDs becoming much more common, 'disk' I/O is having trouble keeping up.

So a full time and queue based approach similar to what's happening in the network space may end up being a good fit. The problem is how to get there from here (especially safely)

Dirty page caching not mostly harmless

Posted Mar 27, 2014 16:25 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

> It used to be that the spinning rust was so slow compared to the rest of the system that it was worth just about any computational effort to optimize it

Computational effort is one thing. RAM cost is another, and it frequently dominates.

Linux has worse problems with slow devices than with fast ones. SSDs are raising the high end, but at the same time enormous USB sticks and slow "green" rust drives are lowering the low end. It's annoying to have your fast SSD be idle 10% of the time because the filesystem can't keep up with it, but it's much more annoying to have most of your RAM be unusable for hours because someone plugged the slowest USB stick money can buy into your server, then tried to copy a few dozen GB of data onto it.

The solution is the same at both extremes--limit the number of dirty pages and stop delaying writes so much. It's the spinning rust in the middle of the range that still needs big dirty page caches and big write delays to form them--and even then, there are pretty low limits to the useful data size for optimization. Fast SSDs don't need optimization and slow "green" drives are so slow that the first page write alone will provide plenty of time for an optimizable set of dirty pages to accumulate.

Dirty page caching not mostly harmless

Posted Mar 27, 2014 16:25 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

>> if a process is continuing to make more dirty pages faster than they can be cleaned, such writes can last indefinitely.
> well, that's true no matter what your limit is.

All the more reason to have a plan for this when it happens. Throttle the process dirtying pages so it can't get too far ahead of the block device. Preempt large writes with higher priority I/O (or even equal priority) from other processes.

This can already be done half a dozen ways, but ideally the block layer would default to a more reasonable behavior by itself.

Asynchronous commit

Posted Mar 26, 2014 21:11 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

I was kinda hoping to read something like: "There was a meeting of PostgreSQL and kernel developers. Linux is going to get asynchronous commit."

Suppose I am using the Unix write-close-rename atomic file update idiom, and I write five versions of a single file (A, B, C, D, and E) in rapid succession, and then there's a crash. I usually don't care which version of A-E I end up with, as long as I get some version intact (or, if version A created the file in the first place, no version at all). In PostgreSQL, I can ask for asynchronous commit, which does more or less exactly that.

Unfortunately there's no way (that I know of) to ask the kernel for this and still be fast. I seem to have three bad choices: call fsync() where I want a write barrier, which slows down both the process that is writing the file and every other process sharing a disk with the writing process; or don't call fsync(), which may under some circumstances give me a file that is empty or broken; or put clunky workarounds in application code (e.g. use a tmpfs for the file and copy the file to a persistent disk with rsync in a loop running at whatever speed the disk can keep up with, or put the filesystem in a database server that implements asynchronous commit).

Ideally the kernel would provide some mechanism that preserves write order around user-supplied barriers, but does not otherwise affect the scheduling of writes or the process doing the writing, and still allows the kernel to freely reorder or coalesce writes between the barriers.

Asynchronous commit

Posted Mar 27, 2014 8:01 UTC (Thu) by marcH (subscriber, #57642) [Link]

You mean, like... a database?

Sorry could not resist :-)

Asynchronous commit

Posted Mar 27, 2014 16:36 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

I thought I covered that as undesirable option #3 ("put the filesystem in a database server that implements asynchronous commit"). ;)

The RDBMS state of the art has progressed over the last 20 years, providing a wide variety of reasonable data integrity and performance tradeoffs. It would be nice if filesystems could catch up. fsync() is a sledgehammer.

PostgreSQL pain points

Posted Mar 27, 2014 2:42 UTC (Thu) by bergwolf (subscriber, #55931) [Link]

> Calling fadvise() with FADV_DONTNEED can, according to Andres, actually cause pages to be read in; nobody quite understood this behavior, but all agreed it shouldn't work that way.

If they meant that dirty pages are written out, it makes more sense. And I guess PostgreSQL wouldn't want the write out because there are indeed newer version in their own cache.

PostgreSQL pain points

Posted Mar 27, 2014 16:48 UTC (Thu) by Shewmaker (subscriber, #1126) [Link]

Regarding double buffering, mapping files from hundreds of processes being slow, and madvise--LLNL has developed a new mmap implementation (loadable as a module) that is optimized for data-intensive applications. They also saw slowdowns for many processes using the standard mmap, but theirs scales up to hundreds of processes. Papers and code are available. This is going into production on their clusters.

See DI-MMAP.