Member-only story
A PySpark Example for Dealing with Larger than Memory Datasets
A step-by-step tutorial on how to use Spark to perform exploratory data analysis on larger than memory datasets.
Analyzing datasets that are larger than the available RAM memory using Jupyter notebooks and Pandas Data Frames is a challenging issue. This problem has already been addressed (for instance here or here) but my objective here is different. I will be presenting a method for performing exploratory analysis on a large data set with the purpose of identifying and filtering out unnecessary data. The hope is that in the end the filtered data set can be handled by Pandas for the rest of the computations.
The idea for this article came from one of my latest projects involving the analysis of the Open Food Facts database. It contains nutritional information about products sold all around the world and at the time of writing the csv export they provide is 4.2 GB. This was larger than the 3 GB of RAM memory I had on my Ubuntu VM. However, by using PySpark I was able to run some analysis and select only the information that was of interest from my project.
I took the following steps in order to set up my environment on Ubuntu :
- Install Anaconda
- Install Java openJDK 11: sudo apt-get install openjdk-11-jdk. The Java version is important as Spark only works with Java 8 or 11
- Install Apache Spark (version 3.1.2…