On how to use Spark the right way when using the bootstrap method
It’s more complicated than you think.
If you’re here, there’s a small chance that you’ve found my article on one of my social media pages. More likely, you found out how hard it was to compute bootstrap, and since you have access to spark, you thought, "why not use it to increase this to warp speed" (or some other less nerdy concept of fast)? So, you’ve googled it.
Well, there are a few obstacles, but it’s possible. Let’s discuss what are the steps of bootstrapping and how not to naively use spark while calculating it.
The Naive Approach
First, the basics. I assume that you have a good grasp of what bootstrapping even is if you’re reading this article but, if that’s not the case, I would recommend Lorna Yen’s article at Towards Data Science:
Here, we just need to remember the basic steps:
- Take a sample from the population with sample size n.
- Draw a sample from the original sample data with replacement with size n, and replicate B times; each re-sampled sample is called a Bootstrap resample, and there will be B bootstrap resamples in total.
- Evaluate the statistic (mean, median, etc.) θ for each Bootstrap resample, and there will be B estimates of θ in total.
- Construct a sampling distribution with these B Bootstrap statistics and use it to make further statistical inference
The important thing for us is the second step. The thing with Spark is that it works by partitioning the data and into disjoint subsets and distributing them across the nodes in your cluster.
So, let’s say you’re trying to create bootstrap resamples with 1 million data points. In this case, we’re dealing with a very particular dataset, with 999.999 zeroes and only a single number 1. Also, you’re using 50 partitions for this calculation. In that case, what would it be the probability of a resample consisting of 1 million number ones?
In real life, 1/10³⁶. In naive spark, zero. You see, only one of these partitions will have a number one; all the other will only see zeroes. It’ll be impossible to have more than 20.000 ones; that would be the specific case when the only partition in 50 that has the one returns only ones. Thus, a naive approach won’t return all the possible resamples because every resampling needs the complete dataset to count as real bootstrapping.
Bootstrapping with Pyspark
So, how can we do it the right way? A good reference is Paul Cho’s blog post, Bootstrapping with spark.
In it, Paul explains the same issue I’ve explained above and how to circumvent it, but with code in Scala. Here, we’ll show how to do the same with Pyspark.
We have to use Spark as a parallel execution engine so that we could generate the bootstrap resamples in parallel. That will allow us to broadcast the full dataset to every single node. To do that, we need to create a higher-order function to help us run a statistical function in parallel on this broadcasted data.
First, we create a RDD (resilient distributed dataset) with the number of desired resamples. To each resample index, we map the statistical function we want to apply to the data. After that, we convert the RDD into a Spark Data Frame. Finally, we rename the columns to something more relevant than "_1" and "_2", and voilá, we have a spark data frame with the calculated statistics for every resample, and we can proceed to the construction of the distribution.
Below, we show an example of a function to get a mean of the bootstrapping resample:
In this function, we’re taking a list of floating numbers; for each index of the list, we take a random sample of the list and add it to a sum. In the end, we return the mean of this resample of the list.
The full code would loke something like this:
Final Notes
Some final notes. This will not be as fast as naive spark, but will be faster than regular python, especially for big sets of data.
Also, Paul Cho proposed another set of functions to compress datasets with millions of data points, which takes a toll in performance, but allows Spark to broadcast those datasets without memory issues.
For our purposes, the functions that I’ve decribed above suits just fine, but we can get back to translating Paul’s Scala code to pyspark in the future.