Session 11· 03· 25 min
Semantic Search
What you'll learn
- ▸Build a vector index from scratch using chunking and embeddings
- ▸Perform semantic search over document chunks
- ▸Identify when semantic search fails and why
VectorIndex class
vector_index.py
class VectorIndex:
def __init__(self):
self.chunks: list[str] = []
self.vectors: list[list[float]] = []
def add(self, chunks: list[str]) -> None:
for chunk in chunks:
self.chunks.append(chunk)
self.vectors.append(generate_embedding(chunk))
def search(self, query: str, top_k: int = 3) -> list[tuple[float, str]]:
q_vec = generate_embedding(query)
scores = [
(cosine_similarity(q_vec, v), c)
for v, c in zip(self.vectors, self.chunks)
]
scores.sort(key=lambda x: x[0], reverse=True)
return scores[:top_k]①chunks and vectors are parallel lists — index i always corresponds to the same piece of text
②add() embeds each chunk immediately so search() only embeds the query
③search() computes cosine similarity against every stored vector — fine for < 10k chunks
④Results are sorted descending so index 0 is always the best match
⑤top_k=3 is a sensible default; more chunks means more context but higher token cost
When semantic search fails
vector_index.py
# This query will likely fail — "E_DEADLOCK_0x8F3" is an exact token
results = index.search('What causes error E_DEADLOCK_0x8F3?')
# The vector for this query has no meaningful relationship
# to chunks that contain the literal string "E_DEADLOCK_0x8F3"Semantic search blind spots
Error codes, function names, product identifiers, and rare proper nouns are poorly served by semantic search. The embedding model has not seen enough examples to place these tokens in a meaningful region of vector space. For such queries, keyword search (BM25) is much more effective — covered in lesson 04.
Knowledge Check
A user searches for "numpy.einsum signature". Semantic search returns chunks about "matrix multiplication" and "tensor operations" instead. Why?
Recap — what you just learned
- ✓VectorIndex stores parallel lists of chunks and their embeddings for O(n) similarity search
- ✓At query time, embed the query and rank all chunks by cosine similarity
- ✓Semantic search excels at concept matching but misses exact tokens like error codes and function names
- ✓For exact-token queries, add BM25 keyword search (next lesson)