DS Algorithms

The 5 Sampling Algorithms every Data Scientist need to know

Or at least should have heard about

Rahul Agarwal

Published in

Towards Data Science

5 min readJul 21, 2019

Data Science is the study of algorithms.

I grapple through with many algorithms on a day to day basis so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series.

This post is about some of the most common sampling techniques one can use while working with data.

Simple Random Sampling

Say you want to select a subset of a population in which each member of the subset has an equal probability of being chosen.

Below we select 100 sample points from a dataset.

sample_df = df.sample(100)

Stratified Sampling

Assume that we need to estimate the average number of votes for each candidate in an election. Assume that the country has 3 towns:

Town A has 1 million factory workers,

Town B has 2 million workers, and

Town C has 3 million retirees.

We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation.

Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively then we can produce a smaller error in estimation for the same total size of the sample.

You can do something like this pretty easily with Python:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

DS Algorithms

The 5 Sampling Algorithms every Data Scientist need to know

Or at least should have heard about

Simple Random Sampling

Stratified Sampling

Reservoir Sampling

Read the full story with a free account.

Written by Rahul Agarwal

More from Rahul Agarwal and Towards Data Science

How to use Huggingface to use LLama-2 on your custom machine?

It was not hard, just tricky.

All You Need to Know to Build Your First LLM App

A step-by-step tutorial to document loaders, embeddings, vector stores and prompt templates

Running Llama 2 on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

The 5 Feature Selection Algorithms every Data Scientist should know

Bonus: What makes a good footballer great?

Recommended from Medium

Mastering Monte Carlo: How To Simulate Your Way to Better Machine Learning Models

How a Scientist Playing Solitaire Forever Changed the Game of Statistics

All You Need to Know to Build Your First LLM App

A step-by-step tutorial to document loaders, embeddings, vector stores and prompt templates

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

Coding & Development

Class Imbalance in Machine Learning Problems: A Practical Guide

Five lessons from the trenches of applied data science

Explaining Vector Databases in 3 Levels of Difficulty

From noob to expert: Demystifying vector databases across different backgrounds

All You Need To Know About Bessel Functions

And how to use them in Python

Why is Mean Squared Error (MSE) So Popular?

A quick look at everyone’s favorite loss function