Pandas v Polars

Published in

Level Up Coding

6 min readAug 1, 2023

A database speed test. Who wins?

We have a process that runs against our AWS RDS Postgres database. It’s based on a Python Jupyter Notebook and reads a large table into a Pandas data frame before further manipulating the contents of the data frame.

I was interested to compare the run-time of reading the table into a data frame using Pandas versus using Polars. Note that the version of Pandas tested was 1.5.3.

To give you an idea of the data volume, the table I was targeting contained 6.1 million records and 156 columns. The object size on disk as reported by the dbeaver database client was 2.9 GB.

Before trying any of the below code, ensure you have all required external libraries installed on your system. These should include the following:

pip install adbc-driver-postgres, pyarrow
pip install sqlalchemy, psycopg2, pandas

Pandas

I don’t think I need to explain much about what Pandas is. Its use in Python code is ubiquitous and is one of the main tools that people use to load, explore, visualise and process large amounts of data in Python.

One of the ways to read a Postgres database table using Pandas uses an external database library like sqlalchemy and so your code will probably look something like this:

from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd


query = f"""
        SELECT *
        FROM your_big_table
    """
    

# Create the SQLAlchemy engine with your postgres database connection string
engine = create_engine('postgresql://your_user:your_pass@your_hostname:your_port/your_dbname')


    
with engine.connect() as conn:
    df=pd.read_sql_query(text(query), con = conn)

I have 32GB of RAM available on my laptop and the above code was able to complete in about 26 minutes. If your system is not as highly specced, it might be that you receive some kind of memory error.

Luckily we can potentially fix that by using a chunking strategy, so here’s some other code you can try that implements chunking.


from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd

query = f"""
        SELECT *
        FROM your_big_table…

Pandas v Polars

A database speed test. Who wins?

Pandas

Create an account to read the full story.

Written by Thomas Reid

More from Thomas Reid and Level Up Coding

Six DuckDB SQL enhancements you should learn.

Make your duck go quack with these useful SQL additions

The 5 paid subscriptions I actually use in 2024 as a software engineer

Tools I use that are cheaper than Netflix

Google’s Unexpected Layoff Warning Email Sent to everyone

No one is safe.

DuckDB vs Polars

Can the duck beat the bear?

Recommended from Medium

The 5 paid subscriptions I actually use in 2024 as a software engineer

Tools I use that are cheaper than Netflix

My Personal Collection of 10 Python Decorators One-Liners That Shaped My Career

In my journey as a software developer, I’ve come across numerous tools and techniques that have profoundly influenced my career. Among…

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

Using Polars with Snowflake

There are quite a few DataFrame libraries available now in Python. This is a great situation to have because it gives developers choices. I…

Python — 3 Ways to Work with Databases (SQL, ORM, Query Builder)

Comparing raw SQL queries, ORM, and SQL query builder with code examples

Introducing Pandarallel: Never Use The Apply Method In Pandas Again

Why I stopped using Apply() in Pandas and why you should too.

Say Goodbye to Loops in Python, and Welcome Vectorization!

Use Vectorization — a super-fast alternative to loops in Python