Session 11· 04· 25 min

Keyword Search with BM25

What you'll learn
  • Understand term frequency and inverse document frequency (IDF)
  • Build a BM25 index from scratch in pure Python
  • Know when BM25 outperforms semantic search and when it loses

Tokenisation

bm25.py
import re
from math import log
from collections import Counter

def tokenise(text: str) -> list[str]:
    return re.findall(r'[a-z0-9_]+', text.lower())

IDF formula

bm25.py
# IDF(t) = log((N - df + 0.5) / (df + 0.5) + 1)
# N  = total number of documents
# df = number of documents containing term t

Full BM25Index class

bm25.py
class BM25Index:
    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.chunks: list[str] = []
        self.tf: list[Counter] = []
        self.idf: dict[str, float] = {}
        self.avg_dl: float = 0.0

    def add(self, chunks: list[str]) -> None:
        self.chunks = chunks
        tokenised = [tokenise(c) for c in chunks]
        self.tf = [Counter(t) for t in tokenised]
        N = len(chunks)
        self.avg_dl = sum(len(t) for t in tokenised) / max(N, 1)
        df: Counter = Counter()
        for tokens in tokenised:
            for term in set(tokens):
                df[term] += 1
        self.idf = {
            term: log((N - freq + 0.5) / (freq + 0.5) + 1)
            for term, freq in df.items()
        }

    def search(self, query: str, top_k: int = 3) -> list[tuple[float, str]]:
        q_tokens = tokenise(query)
        scores = []
        for i, chunk in enumerate(self.chunks):
            dl = sum(self.tf[i].values())
            score = 0.0
            for term in q_tokens:
                if term not in self.idf:
                    continue
                tf_val = self.tf[i].get(term, 0)
                num = tf_val * (self.k1 + 1)
                denom = tf_val + self.k1 * (1 - self.b + self.b * dl / self.avg_dl)
                score += self.idf[term] * num / denom
            scores.append((score, chunk))
        scores.sort(key=lambda x: x[0], reverse=True)
        return scores[:top_k]
k1 controls term frequency saturation — higher k1 means TF keeps growing with repetition
b controls document length normalisation — 0 disables it, 1 full normalisation
tf stores per-document token counts; idf stores corpus-level term rarity weights
avg_dl is the average document length used for length normalisation in BM25
The inner loop accumulates BM25 score for each query term found in the chunk

BM25 vs semantic search

BM25 wins
exact token queries
  • Error codes: "E_DEADLOCK_0x8F3"
  • Function names: "numpy.einsum"
  • Product SKUs and identifiers
  • Rare proper nouns and abbreviations
  • Queries where the exact word must appear
Semantic wins
concept queries
  • Paraphrase queries: "how do models learn?"
  • Cross-lingual or synonym-rich queries
  • Questions where the answer uses different words
  • Short or ambiguous queries
  • Domain jargon not in training data
Default BM25 parameters
k1=1.5 and b=0.75 are the standard defaults used in most academic benchmarks. Only tune them if you have a large eval dataset to measure against.
Knowledge Check
A user queries "AttributeError: 'NoneType' object has no attribute 'split'". Which retriever is more likely to find the relevant documentation chunk?
Recap — what you just learned
  • BM25 scores documents by term frequency weighted by inverse document frequency
  • IDF gives high weight to rare terms and near-zero weight to common words
  • k1=1.5 and b=0.75 are reliable defaults; only tune with measured eval data
  • BM25 beats semantic search for exact-token queries: error codes, function names, identifiers
  • The two methods are complementary — combining them (hybrid retrieval) is next
Next up: Hybrid Retrieval with RRF