Session 11· 04· 25 min
Keyword Search with BM25
What you'll learn
- ▸Understand term frequency and inverse document frequency (IDF)
- ▸Build a BM25 index from scratch in pure Python
- ▸Know when BM25 outperforms semantic search and when it loses
Tokenisation
bm25.py
import re
from math import log
from collections import Counter
def tokenise(text: str) -> list[str]:
return re.findall(r'[a-z0-9_]+', text.lower())IDF formula
bm25.py
# IDF(t) = log((N - df + 0.5) / (df + 0.5) + 1)
# N = total number of documents
# df = number of documents containing term tFull BM25Index class
bm25.py
class BM25Index:
def __init__(self, k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.chunks: list[str] = []
self.tf: list[Counter] = []
self.idf: dict[str, float] = {}
self.avg_dl: float = 0.0
def add(self, chunks: list[str]) -> None:
self.chunks = chunks
tokenised = [tokenise(c) for c in chunks]
self.tf = [Counter(t) for t in tokenised]
N = len(chunks)
self.avg_dl = sum(len(t) for t in tokenised) / max(N, 1)
df: Counter = Counter()
for tokens in tokenised:
for term in set(tokens):
df[term] += 1
self.idf = {
term: log((N - freq + 0.5) / (freq + 0.5) + 1)
for term, freq in df.items()
}
def search(self, query: str, top_k: int = 3) -> list[tuple[float, str]]:
q_tokens = tokenise(query)
scores = []
for i, chunk in enumerate(self.chunks):
dl = sum(self.tf[i].values())
score = 0.0
for term in q_tokens:
if term not in self.idf:
continue
tf_val = self.tf[i].get(term, 0)
num = tf_val * (self.k1 + 1)
denom = tf_val + self.k1 * (1 - self.b + self.b * dl / self.avg_dl)
score += self.idf[term] * num / denom
scores.append((score, chunk))
scores.sort(key=lambda x: x[0], reverse=True)
return scores[:top_k]①k1 controls term frequency saturation — higher k1 means TF keeps growing with repetition
②b controls document length normalisation — 0 disables it, 1 full normalisation
③tf stores per-document token counts; idf stores corpus-level term rarity weights
④avg_dl is the average document length used for length normalisation in BM25
⑤The inner loop accumulates BM25 score for each query term found in the chunk
BM25 vs semantic search
BM25 wins
exact token queries
- •Error codes: "E_DEADLOCK_0x8F3"
- •Function names: "numpy.einsum"
- •Product SKUs and identifiers
- •Rare proper nouns and abbreviations
- •Queries where the exact word must appear
Semantic wins
concept queries
- •Paraphrase queries: "how do models learn?"
- •Cross-lingual or synonym-rich queries
- •Questions where the answer uses different words
- •Short or ambiguous queries
- •Domain jargon not in training data
Default BM25 parameters
k1=1.5 and b=0.75 are the standard defaults used in most academic benchmarks. Only tune them if you have a large eval dataset to measure against.
Knowledge Check
A user queries "AttributeError: 'NoneType' object has no attribute 'split'". Which retriever is more likely to find the relevant documentation chunk?
Recap — what you just learned
- ✓BM25 scores documents by term frequency weighted by inverse document frequency
- ✓IDF gives high weight to rare terms and near-zero weight to common words
- ✓k1=1.5 and b=0.75 are reliable defaults; only tune with measured eval data
- ✓BM25 beats semantic search for exact-token queries: error codes, function names, identifiers
- ✓The two methods are complementary — combining them (hybrid retrieval) is next