Publish AI, ML & data-science insights to a global community of data professionals.

So you want to bootstrap with PySpark

It's a little more complicated than you may think at first…

On how to use Spark the right way when using the bootstrap method

Photo by Joshua Chai on Unsplash
Photo by Joshua Chai on Unsplash

It’s more complicated than you think.

If you’re here, there’s a small chance that you’ve found my article on one of my social media pages. More likely, you found out how hard it was to compute bootstrap, and since you have access to spark, you thought, "why not use it to increase this to warp speed" (or some other less nerdy concept of fast)? So, you’ve googled it.

Well, there are a few obstacles, but it’s possible. Let’s discuss what are the steps of bootstrapping and how not to naively use spark while calculating it.

The Naive Approach

First, the basics. I assume that you have a good grasp of what bootstrapping even is if you’re reading this article but, if that’s not the case, I would recommend Lorna Yen’s article at Towards Data Science:

An Introduction to the Bootstrap Method

Here, we just need to remember the basic steps:

basic steps of boostrap method
basic steps of boostrap method
  1. Take a sample from the population with sample size n.
  2. Draw a sample from the original sample data with replacement with size n, and replicate B times; each re-sampled sample is called a Bootstrap resample, and there will be B bootstrap resamples in total.
  3. Evaluate the statistic (mean, median, etc.) θ for each Bootstrap resample, and there will be B estimates of θ in total.
  4. Construct a sampling distribution with these B Bootstrap statistics and use it to make further statistical inference

The important thing for us is the second step. The thing with Spark is that it works by partitioning the data and into disjoint subsets and distributing them across the nodes in your cluster.

So, let’s say you’re trying to create bootstrap resamples with 1 million data points. In this case, we’re dealing with a very particular dataset, with 999.999 zeroes and only a single number 1. Also, you’re using 50 partitions for this calculation. In that case, what would it be the probability of a resample consisting of 1 million number ones?

1 million data points distributed across 50 partitions, only a single 1 and all the rests are 0
1 million data points distributed across 50 partitions, only a single 1 and all the rests are 0

In real life, 1/10³⁶. In naive spark, zero. You see, only one of these partitions will have a number one; all the other will only see zeroes. It’ll be impossible to have more than 20.000 ones; that would be the specific case when the only partition in 50 that has the one returns only ones. Thus, a naive approach won’t return all the possible resamples because every resampling needs the complete dataset to count as real bootstrapping.

Bootstrapping with Pyspark

So, how can we do it the right way? A good reference is Paul Cho’s blog post, Bootstrapping with spark.

Bootstrapping with Spark

In it, Paul explains the same issue I’ve explained above and how to circumvent it, but with code in Scala. Here, we’ll show how to do the same with Pyspark.

We have to use Spark as a parallel execution engine so that we could generate the bootstrap resamples in parallel. That will allow us to broadcast the full dataset to every single node. To do that, we need to create a higher-order function to help us run a statistical function in parallel on this broadcasted data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
def getParallelBootstrap(data, function, num_samples):
rdd=spark.sparkContext.parallelize(list(range(1, num_samples+1)))
df = rdd\
.map(lambda x: (x,function(data)))\
.toDF()\
.withColumnRenamed('_1', 'sample')\
.withColumnRenamed('_2', 'sample_agg')
return df

First, we create a RDD (resilient distributed dataset) with the number of desired resamples. To each resample index, we map the statistical function we want to apply to the data. After that, we convert the RDD into a Spark Data Frame. Finally, we rename the columns to something more relevant than "_1" and "_2", and voilá, we have a spark data frame with the calculated statistics for every resample, and we can proceed to the construction of the distribution.

Below, we show an example of a function to get a mean of the bootstrapping resample:

import random as rd
def getBootstrapedMean(data: list[float]):
number_points = len(data)
sum_tally = 0
for inx in range(0,n):
draw = data[rd.randint(0, n-1)]
sum_tally = sum_tally + draw
return sum_tally/number_points

In this function, we’re taking a list of floating numbers; for each index of the list, we take a random sample of the list and add it to a sum. In the end, we return the mean of this resample of the list.

The full code would loke something like this:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
import random as rd
def getParallelBootstrap(data, function, num_samples):
rdd=spark.sparkContext.parallelize(list(range(1, num_samples+1)))
df = rdd\
.map(lambda x: (x,function(data)))\
.toDF()\
.withColumnRenamed('_1', 'sample')\
.withColumnRenamed('_2', 'sample_agg')
return df
def getBootstrapedMean(data: list[float]):
number_points = len(data)
sum_tally = 0
for inx in range(0,n):
draw = data[rd.randint(0, n-1)]
sum_tally = sum_tally + draw
return sum_tally/number_points
# my_data = some list of floating numbers
#10.000 resamples:
bootstrapp_samples = getParallelBootstrap(my_data, getBootstrapedMean(), 10000)

Final Notes

Some final notes. This will not be as fast as naive spark, but will be faster than regular python, especially for big sets of data.

Also, Paul Cho proposed another set of functions to compress datasets with millions of data points, which takes a toll in performance, but allows Spark to broadcast those datasets without memory issues.

For our purposes, the functions that I’ve decribed above suits just fine, but we can get back to translating Paul’s Scala code to pyspark in the future.


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.

Recommended Articles