Session 11· 04· 25 min

Keyword Search with BM25

What you'll learn

▸Understand term frequency and inverse document frequency (IDF)
▸Build a BM25 index from scratch in pure Python
▸Know when BM25 outperforms semantic search and when it loses

Tokenisation

bm25.py

import re
from math import log
from collections import Counter

def tokenise(text: str) -> list[str]:
    return re.findall(r'[a-z0-9_]+', text.lower())

IDF formula

bm25.py

# IDF(t) = log((N - df + 0.5) / (df + 0.5) + 1)
# N  = total number of documents
# df = number of documents containing term t

Full BM25Index class

bm25.py

class BM25Index:
    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.chunks: list[str] = []
        self.tf: list[Counter] = []
        self.idf: dict[str, float] = {}
        self.avg_dl: float = 0.0

    def add(self, chunks: list[str]) -> None:
        self.chunks = chunks
        tokenised = [tokenise(c) for c in chunks]
        self.tf = [Counter(t) for t in tokenised]
        N = len(chunks)
        self.avg_dl = sum(len(t) for t in tokenised) / max(N, 1)
        df: Counter = Counter()
        for tokens in tokenised:
            for term in set(tokens):
                df[term] += 1
        self.idf = {
            term: log((N - freq + 0.5) / (freq + 0.5) + 1)
            for term, freq in df.items()
        }

    def search(self, query: str, top_k: int = 3) -> list[tuple[float, str]]:
        q_tokens = tokenise(query)
        scores = []
        for i, chunk in enumerate(self.chunks):
            dl = sum(self.tf[i].values())
            score = 0.0
            for term in q_tokens:
                if term not in self.idf:
                    continue
                tf_val = self.tf[i].get(term, 0)
                num = tf_val * (self.k1 + 1)
                denom = tf_val + self.k1 * (1 - self.b + self.b * dl / self.avg_dl)
                score += self.idf[term] * num / denom
            scores.append((score, chunk))
        scores.sort(key=lambda x: x[0], reverse=True)
        return scores[:top_k]

①k1 controls term frequency saturation — higher k1 means TF keeps growing with repetition

②b controls document length normalisation — 0 disables it, 1 full normalisation

③tf stores per-document token counts; idf stores corpus-level term rarity weights

④avg_dl is the average document length used for length normalisation in BM25

⑤The inner loop accumulates BM25 score for each query term found in the chunk

BM25 vs semantic search

BM25 wins

exact token queries

•Error codes: "E_DEADLOCK_0x8F3"
•Function names: "numpy.einsum"
•Product SKUs and identifiers
•Rare proper nouns and abbreviations
•Queries where the exact word must appear

Semantic wins

concept queries

•Paraphrase queries: "how do models learn?"
•Cross-lingual or synonym-rich queries
•Questions where the answer uses different words
•Short or ambiguous queries
•Domain jargon not in training data

Default BM25 parameters

k1=1.5 and b=0.75 are the standard defaults used in most academic benchmarks. Only tune them if you have a large eval dataset to measure against.

Knowledge Check

A user queries "AttributeError: 'NoneType' object has no attribute 'split'". Which retriever is more likely to find the relevant documentation chunk?

Recap — what you just learned

✓BM25 scores documents by term frequency weighted by inverse document frequency
✓IDF gives high weight to rare terms and near-zero weight to common words
✓k1=1.5 and b=0.75 are reliable defaults; only tune with measured eval data
✓BM25 beats semantic search for exact-token queries: error codes, function names, identifiers
✓The two methods are complementary — combining them (hybrid retrieval) is next

Next up: Hybrid Retrieval with RRF