CYBERTEC PostgreSQL Logo

Super fast aggregations in PostgreSQL 19

11.2025
Category: 
Tags: 

PostgreSQL 18 has just been born and we are already talking about the blessings of PostgreSQL 19 and beyond? Well, yes, and there is a good reason for it.

Recently, an important series of changes have been committed, which some of our folks (and many others around the world) have been working on for many many years and which have the power to seriously speed up aggregations for basically everyone out there. What is even better is that you don't have to change your code, there is no need to adjust parameters or anything of that sort. You can simply run your code as it is and enjoy the benefits of this important improvement.

Aggregating data in PostgreSQL

To understand why this improvement in the query optimizer is so important, we first have to understand how PostgreSQL has handled grouping up to now. The most simple rule was: "Join first, aggregate later". What does that mean? Consider the following example:

Let us assume that we only store a handful of genders but millions of people. The way everything before PostgreSQL 19 is handling this type of query is the following. Keep in mind that the example is of course highly simplified to make understanding easy:

  • Read every entry in the person table
  • For every entry, look up the gender_name and add to the count in the desired group
  • Display the result

What is fundamentally wrong with this approach? Actually nothing. This is how most systems would handle this type of operation. However, there is an inefficiency here: In a large table, millions of people might be female - what the system does is to look up the name for each gender_id again and again. In case almost every id is different, this is fine. However, if there are only a handful of different ids, the operation takes way too long and becomes pretty repetitive.

The breakthrough: Aggregate first - join later

The big breakthrough in PostgreSQL 19 (which is, at the point of writing, under development) is that the optimizer can now decide: Aggregate first and join later or join first and aggregate later? What sounds like a tiny improvement has MASSIVE implications. 

In the next listing, you can see some typical code which you might find in countless applications:

The data structure features two lookup tables (for categories and colors), which hardly contain data, and a massive product table consisting of potentially millions of rows. In our example it is populated with just 200.000 rows:

The goal is simple: Count the number of products per category and per color. The query to do this is far from fancy and might look as follows:

As you can observe, we are talking about a join spanning just three tables. For every product, all we do is to simply lookup two names. In versions older than PostgreSQL 19 (assuming the patch won't be reverted during development) our execution plan would have looked as follows:

What we have to do here is to read the plan "from the inside to the outside" to understand what is going on. The plan starts with a sequential scan on color and category until it finally ends up joining all those lookup tables with the product table. Only then does aggregation (= counting) start. In other words, for every product we had to run two lookups.

Let us compare this to the new method:

What we see here is that we immediately approach the product table and aggregate it by those ids, and THEN apply nested loops to join the data. Then, the rest is easy, because after the HashAggregate almost no data is left (only a handful of rows). The trick is that, after aggregating by those ids, we only have to lookup a couple of names which saves countless iterations.

Inspecting database performance

Obviously, the second method is more efficient. The following graph shows the difference one can hope for:

Wow, this is more than 5 times faster. The beauty is: The more lookup tables there are, the bigger the gain we can expect. The improvement is massive.

Side note: First run, no hint bits, fresh statistics, Macbook M3, default PostgreSQL configuration.

CUBE: Limitations

While the results are stunning, in most cases and for the vast majority of applications, there are of course also some corner cases or less frequently used features which cannot fully benefit from this functionality. One of them is GROUP BY CUBE:

The execution plan looks as follows:

In this case, we can see that all those aggregations (the idea of CUBE is to perform many of them at a time) are still performed on top. For various reasons this is necessary in this case. However, as I stated before: The use of this functionality is not as common as a simple straight forward GROUP BY.

Finally ...

If you want to learn more about cubes and grouping sets in PostgreSQL, we recommend to check out some relevant blogs about this essential topic:

Webinar CTA Banner

2 comments on “Super fast aggregations in PostgreSQL 19”

  1. Whathappens if I categorize by a dimension table which contains both the fine grained id + name and a broader category. E.g the dimension is on department but also contains divisions where multiple departments share a division? Does GB the do two groups, first by the original department I'd and then after joining by the division name?

Leave a Reply

Your email address will not be published. Required fields are marked *

©
2025
CYBERTEC PostgreSQL International GmbH
phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram