A database speed test. Who wins?
We have a process that runs against our AWS RDS Postgres database. It’s based on a Python Jupyter Notebook and reads a large table into a Pandas data frame before further manipulating the contents of the data frame.
I was interested to compare the run-time of reading the table into a data frame using Pandas versus using Polars. Note that the version of Pandas tested was 1.5.3.
To give you an idea of the data volume, the table I was targeting contained 6.1 million records and 156 columns. The object size on disk as reported by the dbeaver database client was 2.9 GB.
Before trying any of the below code, ensure you have all required external libraries installed on your system. These should include the following:
pip install adbc-driver-postgres, pyarrow
pip install sqlalchemy, psycopg2, pandas
Pandas
I don’t think I need to explain much about what Pandas is. Its use in Python code is ubiquitous and is one of the main tools that people use to load, explore, visualise and process large amounts of data in Python.
One of the ways to read a Postgres database table using Pandas uses an external database library like sqlalchemy and so your code will probably look something like this:
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd
query = f"""
SELECT *
FROM your_big_table
"""
# Create the SQLAlchemy engine with your postgres database connection string
engine = create_engine('postgresql://your_user:your_pass@your_hostname:your_port/your_dbname')
with engine.connect() as conn:
df=pd.read_sql_query(text(query), con = conn)
I have 32GB of RAM available on my laptop and the above code was able to complete in about 26 minutes. If your system is not as highly specced, it might be that you receive some kind of memory error.
Luckily we can potentially fix that by using a chunking strategy, so here’s some other code you can try that implements chunking.
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd
query = f"""
SELECT *
FROM your_big_table…