Session 11· 01· 20 min
Chunking a Document
What you'll learn
- ▸Explain why chunking is needed before embedding
- ▸Implement three chunking strategies: character, sentence, and section
- ▸Choose the right strategy for different document types
Session overview
Session 11 covers RAG — Retrieval Augmented Generation. Instead of stuffing an entire document into a prompt, RAG retrieves only the relevant pieces. Chunking is the first step: splitting documents into pieces small enough to embed efficiently.
Why chunk?
Language model context windows have token limits. Even when a document fits, sending irrelevant paragraphs wastes tokens, increases cost, and dilutes the signal for the model. Chunking splits a document into small, meaningful pieces so only the most relevant pieces are retrieved and sent.
- Size limits — embedding models cap at 8k tokens; generation models charge per token
- Quality — smaller, focused chunks produce more accurate similarity scores
- Cost — retrieving 3 chunks costs far less than sending 300 pages
Strategy 1 — Character chunking
chunking.py
def chunk_by_char(text: str, size: int = 500, overlap: int = 50) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start += size - overlap
return chunksStrategy 2 — Sentence chunking
chunking.py
import re
def chunk_by_sentence(text: str, max_sentences: int = 5) -> list[str]:
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
chunks = []
for i in range(0, len(sentences), max_sentences):
chunk = ' '.join(sentences[i:i + max_sentences])
if chunk:
chunks.append(chunk)
return chunksStrategy 3 — Section chunking
chunking.py
def chunk_by_section(text: str, separator: str = '\n## ') -> list[str]:
raw = text.split(separator)
chunks = []
for part in raw:
part = part.strip()
if part:
chunks.append(part)
return chunksCharacter
fast, format-agnostic
- •Works on any text
- •No parsing required
- •May split mid-sentence
- •Best for dense, unstructured text
Sentence
grammar-aware
- •Respects sentence boundaries
- •Needs reliable sentence splitting
- •Variable chunk length
- •Best for prose documents
Section
structure-aware
- •Uses document headings
- •Semantically coherent chunks
- •Fails if no headings exist
- •Best for Markdown, wikis, docs
No API keys needed
Chunking is pure Python — no external API calls. Run these functions locally on any text file. You will wire them into the full pipeline in lesson 06.
Knowledge Check
A 200-page technical PDF has been converted to plain text with no headings. Which chunking strategy is most appropriate?
Recap — what you just learned
- ✓Chunking splits documents so only relevant pieces are retrieved, saving tokens and improving quality
- ✓Character chunking is format-agnostic and fast; use overlap to avoid losing context at seams
- ✓Sentence chunking respects grammar; best for prose
- ✓Section chunking uses document structure; best for Markdown and wikis