DS Algorithms

The 5 Sampling Algorithms every Data Scientist need to know

Or at least should have heard about

Rahul Agarwal
Towards Data Science
5 min readJul 21, 2019

Data Science is the study of algorithms.

I grapple through with many algorithms on a day to day basis so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series.

This post is about some of the most common sampling techniques one can use while working with data.

Simple Random Sampling

Say you want to select a subset of a population in which each member of the subset has an equal probability of being chosen.

Below we select 100 sample points from a dataset.

sample_df = df.sample(100)

Stratified Sampling

Assume that we need to estimate the average number of votes for each candidate in an election. Assume that the country has 3 towns:

Town A has 1 million factory workers,

Town B has 2 million workers, and

Town C has 3 million retirees.

We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation.

Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively then we can produce a smaller error in estimation for the same total size of the sample.

You can do something like this pretty easily with Python:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)

Reservoir Sampling

Read the full story with a free account.

The author made this story available to Medium members only.
Sign up to read this one for free.

Or, continue in mobile web

Already have an account? Sign in