Member-only story

BM25S — Efficacy Improvement of BM25 Algorithm in Document Retrieval

bm25s, an implementation of the BM25 algorithm in Python, utilizes Scipy and helps boost speed in document retrieval

Chien Vu
TDS Archive
10 min readAug 12, 2024

Image by author

Background of the BM25 algorithm

BM25, short for Best Match 25, is a popular vector-based document retrieval algorithm. BM25 aims to deliver accurate and relevant search results by scoring documents based on their term frequencies and lengths.

BM25 uses term frequency and inverse document frequency as a part of its formula. Term frequency and inverse document frequency are the core of TF-IDF.

First, let’s take a quick look at the TF-IDF formula.

TF-IDF formula (Image by author)

In TF-IDF, the importance of the word increases proportionally to the number of times that word appears in the document but is offset by the frequency of the word in the corpus. The first part, Term Frequency (TF), indicates how often a term appears in a specific document. If the term appears more frequently within a document, it is more likely to be significant. However, it is normalized by the total number…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Chien Vu

Written by Chien Vu

PhD, Researcher | I love to connect and share | LinkedIn Top Voice https://www.linkedin.com/in/vumichien/

No responses yet

What are your thoughts?