Member-only story

A PySpark Example for Dealing with Larger than Memory Datasets

A step-by-step tutorial on how to use Spark to perform exploratory data analysis on larger than memory datasets.

4 min readOct 17, 2021

Analyzing datasets that are larger than the available RAM memory using Jupyter notebooks and Pandas Data Frames is a challenging issue. This problem has already been addressed (for instance here or here) but my objective here is different. I will be presenting a method for performing exploratory analysis on a large data set with the purpose of identifying and filtering out unnecessary data. The hope is that in the end the filtered data set can be handled by Pandas for the rest of the computations.

The idea for this article came from one of my latest projects involving the analysis of the Open Food Facts database. It contains nutritional information about products sold all around the world and at the time of writing the csv export they provide is 4.2 GB. This was larger than the 3 GB of RAM memory I had on my Ubuntu VM. However, by using PySpark I was able to run some analysis and select only the information that was of interest from my project.

I took the following steps in order to set up my environment on Ubuntu :

Install Anaconda
Install Java openJDK 11: sudo apt-get install openjdk-11-jdk. The Java version is important as Spark only works with Java 8 or 11
Install Apache Spark (version 3.1.2…

TDS Archive

A PySpark Example for Dealing with Larger than Memory Datasets

A step-by-step tutorial on how to use Spark to perform exploratory data analysis on larger than memory datasets.

Create an account to read the full story.

Published in TDS Archive

Written by Georgia Deaconu

No responses yet

More from Georgia Deaconu and TDS Archive

Monitoring Databricks jobs through calls to the REST API

Monitoring jobs that run in a Databricks production environment requires not only setting up alerts in case of failure but also being able…

Top 12 Skills Data Scientists Need to Succeed in 2025

It’s (not) all about LLMs and AI tools

How to Build a Knowledge Graph in Minutes (And Make It Enterprise-Ready)

I tried and failed creating one—but it was when LLMs were not a thing!

LLM Safety Training and Jail-Breaking

Since ChatGPT brought LLMs into the spotlight, we witnessed the cat and mouse game between LLM developers trying to implement guardrails…

Recommended from Medium

Achieving Parallelism in Apache Spark with DataFrames

“Parallelism is the secret sauce behind Spark’s speed — but only if you know how to harness it.”

Kubernetes Is Dead: Why Tech Giants Are Secretly Moving to These 5 Orchestration Alternatives

I still remember that strange silence in the meeting room. Our CTO had just announced we were moving away from Kubernetes after two years…

Apache Spark: Core Concepts, Tools, and Applications

Overview of Apache Spark’s Ecosystem and Core Libraries

Reading Data in Spark Like a Pro

A complete guide to how Spark ingests data — from file formats and APIs to handling corrupt records in robust ETL pipelines.

5 Ways to Check If a Spark DataFrame is Empty

Read here for free if you do not have a medium subscription!

🚀 Mastering PySpark: Your Complete Guide to 46 Essential Functions That Will Transform Your Big…

Collection of PySpark Functions