DS Algorithms
The 5 Sampling Algorithms every Data Scientist need to know
Or at least should have heard about
Data Science is the study of algorithms.
I grapple through with many algorithms on a day to day basis so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series.
This post is about some of the most common sampling techniques one can use while working with data.
Simple Random Sampling
Say you want to select a subset of a population in which each member of the subset has an equal probability of being chosen.
Below we select 100 sample points from a dataset.
sample_df = df.sample(100)
Stratified Sampling
Assume that we need to estimate the average number of votes for each candidate in an election. Assume that the country has 3 towns:
Town A has 1 million factory workers,
Town B has 2 million workers, and
Town C has 3 million retirees.
We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation.
Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively then we can produce a smaller error in estimation for the same total size of the sample.
You can do something like this pretty easily with Python:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)