Image by Author

Pandas v Polars

Thomas Reid
Level Up Coding
Published in
6 min readAug 1, 2023

A database speed test. Who wins?

We have a process that runs against our AWS RDS Postgres database. It’s based on a Python Jupyter Notebook and reads a large table into a Pandas data frame before further manipulating the contents of the data frame.

I was interested to compare the run-time of reading the table into a data frame using Pandas versus using Polars. Note that the version of Pandas tested was 1.5.3.

To give you an idea of the data volume, the table I was targeting contained 6.1 million records and 156 columns. The object size on disk as reported by the dbeaver database client was 2.9 GB.

Before trying any of the below code, ensure you have all required external libraries installed on your system. These should include the following:

pip install adbc-driver-postgres, pyarrow
pip install sqlalchemy, psycopg2, pandas

Pandas

I don’t think I need to explain much about what Pandas is. Its use in Python code is ubiquitous and is one of the main tools that people use to load, explore, visualise and process large amounts of data in Python.

One of the ways to read a Postgres database table using Pandas uses an external database library like sqlalchemy and so your code will probably look something like this:

from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd


query = f"""
SELECT *
FROM your_big_table
"""



# Create the SQLAlchemy engine with your postgres database connection string
engine = create_engine('postgresql://your_user:your_pass@your_hostname:your_port/your_dbname')



with engine.connect() as conn:
df=pd.read_sql_query(text(query), con = conn)

I have 32GB of RAM available on my laptop and the above code was able to complete in about 26 minutes. If your system is not as highly specced, it might be that you receive some kind of memory error.

Luckily we can potentially fix that by using a chunking strategy, so here’s some other code you can try that implements chunking.


from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd

query = f"""
SELECT *
FROM your_big_table…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

A data engineer , specialising in the AWS cloud with particular interest in serverless and the energy, finance and insurance sectors