So you want to bootstrap with PySpark

It's a little more complicated than you may think at first…

Dec 15, 2020

4 min read

On how to use Spark the right way when using the bootstrap method

It’s more complicated than you think.

If you’re here, there’s a small chance that you’ve found my article on one of my social media pages. More likely, you found out how hard it was to compute bootstrap, and since you have access to spark, you thought, "why not use it to increase this to warp speed" (or some other less nerdy concept of fast)? So, you’ve googled it.

Well, there are a few obstacles, but it’s possible. Let’s discuss what are the steps of bootstrapping and how not to naively use spark while calculating it.

The Naive Approach

First, the basics. I assume that you have a good grasp of what bootstrapping even is if you’re reading this article but, if that’s not the case, I would recommend Lorna Yen’s article at Towards Data Science:

An Introduction to the Bootstrap Method

Here, we just need to remember the basic steps:

Take a sample from the population with sample size n.
Draw a sample from the original sample data with replacement with size n, and replicate B times; each re-sampled sample is called a Bootstrap resample, and there will be B bootstrap resamples in total.
Evaluate the statistic (mean, median, etc.) θ for each Bootstrap resample, and there will be B estimates of θ in total.
Construct a sampling distribution with these B Bootstrap statistics and use it to make further statistical inference

The important thing for us is the second step. The thing with Spark is that it works by partitioning the data and into disjoint subsets and distributing them across the nodes in your cluster.

So, let’s say you’re trying to create bootstrap resamples with 1 million data points. In this case, we’re dealing with a very particular dataset, with 999.999 zeroes and only a single number 1. Also, you’re using 50 partitions for this calculation. In that case, what would it be the probability of a resample consisting of 1 million number ones?

1 million data points distributed across 50 partitions, only a single 1 and all the rests are 0

In real life, 1/10³⁶. In naive spark, zero. You see, only one of these partitions will have a number one; all the other will only see zeroes. It’ll be impossible to have more than 20.000 ones; that would be the specific case when the only partition in 50 that has the one returns only ones. Thus, a naive approach won’t return all the possible resamples because every resampling needs the complete dataset to count as real bootstrapping.

Bootstrapping with Pyspark

So, how can we do it the right way? A good reference is Paul Cho’s blog post, Bootstrapping with spark.

Bootstrapping with Spark

In it, Paul explains the same issue I’ve explained above and how to circumvent it, but with code in Scala. Here, we’ll show how to do the same with Pyspark.

We have to use Spark as a parallel execution engine so that we could generate the bootstrap resamples in parallel. That will allow us to broadcast the full dataset to every single node. To do that, we need to create a higher-order function to help us run a statistical function in parallel on this broadcasted data.

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import col, expr

	def getParallelBootstrap(data, function, num_samples):

	rdd=spark.sparkContext.parallelize(list(range(1, num_samples+1)))

	df = rdd\
	.map(lambda x: (x,function(data)))\
	.toDF()\
	.withColumnRenamed('_1', 'sample')\
	.withColumnRenamed('_2', 'sample_agg')

	return df

view raw pyspark_bootstrapping_1.py hosted with ❤ by GitHub

First, we create a RDD (resilient distributed dataset) with the number of desired resamples. To each resample index, we map the statistical function we want to apply to the data. After that, we convert the RDD into a Spark Data Frame. Finally, we rename the columns to something more relevant than "_1" and "_2", and voilá, we have a spark data frame with the calculated statistics for every resample, and we can proceed to the construction of the distribution.

Below, we show an example of a function to get a mean of the bootstrapping resample:

	import random as rd

	def getBootstrapedMean(data: list[float]):

	number_points = len(data)
	sum_tally = 0

	for inx in range(0,n):
	draw = data[rd.randint(0, n-1)]
	sum_tally = sum_tally + draw

	return sum_tally/number_points

view raw pyspark_bootstrapping_2.py hosted with ❤ by GitHub

In this function, we’re taking a list of floating numbers; for each index of the list, we take a random sample of the list and add it to a sum. In the end, we return the mean of this resample of the list.

The full code would loke something like this:

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import col, expr
	import random as rd

	def getParallelBootstrap(data, function, num_samples):
	rdd=spark.sparkContext.parallelize(list(range(1, num_samples+1)))
	df = rdd\
	.map(lambda x: (x,function(data)))\
	.toDF()\
	.withColumnRenamed('_1', 'sample')\
	.withColumnRenamed('_2', 'sample_agg')
	return df


	def getBootstrapedMean(data: list[float]):
	number_points = len(data)
	sum_tally = 0

	for inx in range(0,n):
	draw = data[rd.randint(0, n-1)]
	sum_tally = sum_tally + draw
	return sum_tally/number_points

	# my_data = some list of floating numbers

	#10.000 resamples:
	bootstrapp_samples = getParallelBootstrap(my_data, getBootstrapedMean(), 10000)

view raw pyspark_bootstrapping_3.py hosted with ❤ by GitHub

Final Notes

Some final notes. This will not be as fast as naive spark, but will be faster than regular python, especially for big sets of data.

Also, Paul Cho proposed another set of functions to compress datasets with millions of data points, which takes a toll in performance, but allows Spark to broadcast those datasets without memory issues.

For our purposes, the functions that I’ve decribed above suits just fine, but we can get back to translating Paul’s Scala code to pyspark in the future.

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits

Functional Cookies

Strictly Necessary Cookies

Performance Cookies

Targeting Cookies

Recommended Articles