Calculating the Median in PostgreSQL: A Comprehensive Guide

4 min readJun 18, 2024

Calculating the median, a measure of central tendency, is a common requirement in data analysis. PostgreSQL, a powerful open-source relational database, provides a straightforward way to compute the median using its built-in window functions. In this article, we’ll explore how to calculate the median in PostgreSQL, delve into the concept of percentiles, and provide a custom version for computing the median.

Understanding Percentiles

Before diving into the implementation, it’s essential to understand the concept of percentiles. A percentile is a statistical measure that indicates the value below which a given percentage of observations in a group falls. For example, the 50th percentile (median) is the value below which 50% of the observations can be found.

In general terms:
- The 25th percentile (Q1) is the value below which 25% of the data lies.
- The 50th percentile (Q2 or median) is the value below which 50% of the data lies.
- The 75th percentile (Q3) is the value below which 75% of the data lies.

Percentiles help in understanding the distribution of data by dividing it into parts.

Calculating the Median in PostgreSQL

PostgreSQL offers the `percentile_cont` function to calculate percentiles, including the median. This function computes a continuous percentile, which is essential for finding the median accurately.

Let’s go through the steps to calculate the median in PostgreSQL.

1. Creating a Sample Table

First, we’ll create a sample table and insert some data.

CREATE TABLE
    sales (region VARCHAR(50), amount NUMERIC);

INSERT INTO
    sales (region, amount)
VALUES
    ('North', 100),
    ('North', 200),
    ('North', 300),
    ('South', 400),
    ('South', 500),
    ('South', 600);

2. Calculating the Median

To calculate the median of the `amount` column in the `sales` table, we use the `percentile_cont` function.

SELECT
    percentile_cont(0.5) WITHIN GROUP (
        ORDER BY
            amount
    ) AS median_amount
FROM
    sales;

In this query:
- `percentile_cont(0.5)` computes the 50th percentile, which is the median.
- `WITHIN GROUP (ORDER BY amount)` specifies the order of the values before computing the percentile.

The result of this query will be the median value of the `amount` column across all rows in the `sales` table.

Calculating Median by Group

Often, you’ll need to calculate the median for different groups within your data. For example, let’s find the median sales amount for each region.

SELECT
    region,
    percentile_cont(0.5) WITHIN GROUP (
        ORDER BY
            amount
    ) AS median_amount
FROM
    sales
GROUP BY
    region;

In this query:
- We group the data by the `region` column.
- For each group, we compute the median of the `amount` column.

The result will give us the median sales amount for each region.

Custom Version for Calculating the Median

While `percentile_cont` is convenient, there may be scenarios where you want a custom method for calculating the median. Here’s how you can create a custom version using common table expressions (CTEs) and window functions:

WITH
    ordered_sales AS (
        SELECT
            amount,
            ROW_NUMBER() OVER (
                ORDER BY
                    amount
            ) AS row_num,
            COUNT(*) OVER () AS total_count
        FROM
            sales
    )
SELECT
    AVG(amount) AS median_amount
FROM
    ordered_sales
WHERE
    row_num IN ((total_count + 1) / 2, (total_count + 2) / 2);

In this custom version:
- We create a CTE called `ordered_sales` that assigns a row number to each `amount` and counts the total number of rows.
- We then select the average of the amounts where the row number is in the middle of the dataset. This approach ensures that we handle both even and odd numbers of rows correctly.

For grouped data, the custom version can be adapted as follows:

WITH
    ordered_sales AS (
        SELECT
            region,
            amount,
            ROW_NUMBER() OVER (
                PARTITION BY
                    region
                ORDER BY
                    amount
            ) AS row_num,
            COUNT(*) OVER (
                PARTITION BY
                    region
            ) AS total_count
        FROM
            sales
    )
SELECT
    region,
    AVG(amount) AS median_amount
FROM
    ordered_sales
WHERE
    row_num IN ((total_count + 1) / 2, (total_count + 2) / 2)
GROUP BY
    region;

In this adaptation:
- We partition the data by `region` in the window functions to calculate row numbers and total counts within each region.
- The final query groups by `region` and calculates the median for each group.

Why Use Percentile Functions?

Using percentile functions like `percentile_cont` offers several advantages:
- Accuracy: `percentile_cont` calculates the percentile value accurately, even when dealing with continuous data.
- Simplicity: The function simplifies the process of percentile calculation, avoiding complex and manual calculations.
- Flexibility: You can compute any percentile (not just the median) by changing the argument. For example, `percentile_cont(0.25)` calculates the 25th percentile.

Conclusion

Calculating the median in PostgreSQL is straightforward with the `percentile_cont` function, and creating a custom version provides additional flexibility. Understanding and utilizing percentiles can provide valuable insights into your data’s distribution. Whether you’re analyzing sales data or any other dataset, PostgreSQL’s powerful window functions make it easy to perform these calculations efficiently.

By leveraging these functions and custom methods, you can enhance your data analysis capabilities and make informed decisions based on accurate statistical measures. Happy querying!

Calculating the Median in PostgreSQL: A Comprehensive Guide

Understanding Percentiles

Calculating the Median in PostgreSQL

Custom Version for Calculating the Median

Why Use Percentile Functions?

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Gregory Kovalchuk

No responses yet

More from Gregory Kovalchuk

A Beginner’s Guide to Q-Learning: Understanding with a Simple Gridworld Example

Reinforcement learning (RL) is one of the most exciting fields in machine learning, allowing agents to learn optimal behaviors in uncertain…

A Deep Dive into SQL LATERAL JOIN

In SQL, joins are fundamental operations that allow us to combine data from multiple tables. Among the various types of joins, one…

Understanding `yield from` in Python Generators: A Comprehensive Guide

Generators in Python are a powerful tool for creating iterators. They allow you to generate values on the fly and save memory, making your…

Using numpy.typing for Type-Safe List Handling in Python

When working with NumPy in Python, ensuring type safety can help catch errors early and improve code readability. The numpy.typing module…

Recommended from Medium

Optimizing Recursive Queries — Enhancing Performance with LATERAL JOIN

Recursive CTE (Common Table Expressions) provides an elegant way to deal with hierarchical data or traverse parent child relationships…

How I Optimized a Long-Running SQL Query from 25 Minutes to Under 30 Seconds?

Slow SQL can cost you more than time — it can hurt your users, dashboards, and cloud bills. In this post, I’ll walk you through a…

Practical SQL Interview Questions-2

10 SQL Concepts Every Analyst Should Master Before Their First Job

Answering Tricky SQL Interview Questions for Data Engineering

SQL interview questions sometimes seem like puzzles designed to test your problem-solving skills and technical depth. Mastering SQL is…

Google Advanced SQL Interview Question Walkthrough

This is the most common advanced SQL which has gained popularity recently in interviews along with Google. Let’s break down this and…