The kernel has to work for a wide range of workloads; it is arguably
unsurprising that it does not always perform as well as some user
communities would like. One community that sometimes felt left out in the
cold is the PostgreSQL relational database management system project. In
response to an invitation from the organizers of
the 2014 Linux Storage, Filesystem, and Memory Management summit,
PostgreSQL developers Robert Haas, Andres Freund,
and Josh Berkus came to discuss their worst pain points and possible
solutions.
PostgreSQL is an old system, dating back to 1996; it has a lot of users
running on a wide variety of operating systems. So the PostgreSQL
developers are limited in the amount of Linux-specific code they can add.
It is based on cooperating processes; threads are not used. System V
shared memory is used for interprocess communication. Importantly,
PostgreSQL maintains its own internal buffer cache, but also uses buffered
I/O to move data to and from disk. This combination of buffering leads to
a number of problems experienced by PostgreSQL users.
Slow sync
The first problem described is related to how data gets to disk from the
buffer cache. PostgreSQL uses a form of journaling that they call
"write-ahead logging". Changes are first written to the log; once the log
is safely on disk, the main database blocks can be written back. Much of
this work is done in a "checkpoint" process; it writes log entries, then
flushes a bunch of data back to various files on disk. The logging writes
are relatively small and contiguous; they work fairly well, and, according
to Andres, the PostgreSQL developers are happy enough with how that part of
the system works on Linux.
The data writes are another story. The checkpoint process paces those
writes to avoid overwhelming the I/O subsystem. But, when it gets around
to calling fsync() to ensure that the data is safely written, all
of those carefully paced writes are flushed into the request queue at once
and an
I/O storm results. The problem, they said, is not that fsync() is
too slow; instead, it is too fast. It dumps so much data into the I/O
subsystem that everything else, including read requests from applications,
is blocked. That creates pain for users and, thus, for PostgreSQL
developers.
Ted Ts'o asked whether the ability to limit the checkpoint process to a
specific percentage of the available I/O bandwidth would help. But Robert
responded that I/O priorities would be better; the checkpoint process
should be able to use 100% of the bandwidth if nothing else wants it. Use
of the ionice mechanism (which controls I/O priorities in the CFQ scheduler)
was suggested, but there is a problem:
it does not work for I/O initiated from an fsync() call. Even if
the data was written from the checkpoint process — which is not always the
case — priorities are not applied when the actual I/O is started by
fsync().
Ric Wheeler suggested that the PostgreSQL developers need to better control
the speed with which they write data; Chris Mason added that the
O_DATASYNC option could be used to give better control over when
the I/O requests are generated. The problem here is that such approaches
require PostgreSQL to know about the speed of the storage device.
The discussion turned back to I/O priorities, at which point it was pointed
out that those priorities can only be enforced in the request queue
maintained by the I/O scheduler. Some schedulers, including those favored
by PostgreSQL users (who tend to avoid the CFQ scheduler), do not implement
I/O priorities at all. But, even those that do support I/O priorities
place limits on the length
of the request queue. A big flush of data will quickly fill the queue, at
which point I/O priorities lose most of their effectiveness; a
high-priority request will still languish if there is no room for it in the
request queue. So, it seems,
I/O priorities are not the solution to the problem.
It's not clear what the right solution is. Ted asked if the PostgreSQL
developers could provide a small program that would generate the kind of
I/O patterns created by a running database. Given a way to reproduce the
problem easily, kernel developers could experiment with different
approaches to a solution. This "program" may take the form of a
configuration script for PostgreSQL initially, but a separate (small)
program is what the kernel community would really like to see.
Double buffering
PostgreSQL needs to do its own buffering; it also needs to use buffered I/O
for a number of reasons. That leads to a problem, though: database data
tends to be stored in memory twice, once in the PostgreSQL buffer, and once
in the page cache. That increases the amount of memory used by PostgreSQL
considerably, to the detriment of the system as a whole.
Much of that memory waste could conceivably be eliminated. Consider, for
example, a dirty buffer in the PostgreSQL cache. It is more current than
any version of that data that the kernel might have in the page cache; the
only thing that will ever happen to the page cache copy is that it will be
overwritten when PostgreSQL flushes the dirty buffer. So there is no value
to keeping that data in the page cache. In this case, it
would be nice if PostgreSQL could tell the kernel to remove the pages of
interest from the page cache, but there is currently no good API for that.
Calling fadvise() with FADV_DONTNEED can, according to
Andres, actually cause
pages to be read in; nobody quite understood this behavior, but all agreed
it shouldn't work that way. They can't use madvise() without
mapping the files; doing that in possibly hundreds of processes tends to be
very slow.
It would also be nice to be able move pages in the opposite direction:
PostgreSQL might want to remove a clean page from its own cache, but leave
a copy in the page cache. That could possibly be done with a special write
operation that would
not actually cause I/O, or with a system call that would transfer a physical
page into the page cache. There was some talk of the form such an
interface might take, but this part of the discussion eventually wound down
without any firm conclusions.
Regressions
Another problem frequently experienced by PostgreSQL users is that recent
kernel features tend to create performance problems. For example, the
transparent huge pages feature tends not to bring much benefit to
PostgreSQL workloads, but it slows them down significantly. Evidently a
lot of time goes into the compaction code, which is working hard without
actually producing a lot of free huge pages. In many systems, terrible
performance problems simply vanish when transparent huge pages are turned
off.
Mel Gorman answered that, if compaction is hurting performance, it's a bug. That
said, he hasn't seen any transparent huge page bugs for quite some time.
There is, he said, a patch out there which puts a limit on the number of
processes that can be performing compaction at any given time. It has not
been merged, though, because nobody has ever seen a workload where too many
processes running compaction was a problem. It might, he suggested, be
time to revisit that particular patch.
Another source of pain is the "zone reclaim" feature, whereby the kernel
reclaims pages from some zones even if the system as a whole is not short
of memory. Zone reclaim can slow down PostgreSQL workloads; usually the
best thing to do on a PostgreSQL server is to simply disable the feature
altogether. Andres noted that he has been called in as a consultant many
times to deal with performance problems related to zone reclaim; it has, he
said been a good money-maker for him. Still, it would be good if the
problem were to be fixed.
Mel noted that the zone reclaim mode was written under the assumption that
all processes in the system would fit into a single NUMA node. That
assumption no longer makes sense; it's long past time, he said, that this
option's default changed to "off." There seemed to be no opposition to
that idea in the room, so a change may happen sometime in the relatively
near future.
Finally, the PostgreSQL developers noted that, in general, kernel upgrades
tend to be scary. The performance characteristics of Linux kernels tend to
vary widely from one release to the next; that makes upgrades into an
uncertain affair. There was some talk of finding ways to run PostgreSQL
benchmarks on new kernels, but no definite conclusions were reached. As a
whole, though, developers for both projects were happy with how the
conversation came out; if nothing else, it represents a new level of
communication between the two projects.
[Your editor would like to thank the Linux Foundation for supporting his
travel to the LSFMM summit.]
(
Log in to post comments)