Session 11· 01· 20 min

Chunking a Document

What you'll learn
  • Explain why chunking is needed before embedding
  • Implement three chunking strategies: character, sentence, and section
  • Choose the right strategy for different document types
Session overview
Session 11 covers RAG — Retrieval Augmented Generation. Instead of stuffing an entire document into a prompt, RAG retrieves only the relevant pieces. Chunking is the first step: splitting documents into pieces small enough to embed efficiently.

Why chunk?

Language model context windows have token limits. Even when a document fits, sending irrelevant paragraphs wastes tokens, increases cost, and dilutes the signal for the model. Chunking splits a document into small, meaningful pieces so only the most relevant pieces are retrieved and sent.

  • Size limits — embedding models cap at 8k tokens; generation models charge per token
  • Quality — smaller, focused chunks produce more accurate similarity scores
  • Cost — retrieving 3 chunks costs far less than sending 300 pages

Strategy 1 — Character chunking

chunking.py
def chunk_by_char(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

Strategy 2 — Sentence chunking

chunking.py
import re

def chunk_by_sentence(text: str, max_sentences: int = 5) -> list[str]:
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = ' '.join(sentences[i:i + max_sentences])
        if chunk:
            chunks.append(chunk)
    return chunks

Strategy 3 — Section chunking

chunking.py
def chunk_by_section(text: str, separator: str = '\n## ') -> list[str]:
    raw = text.split(separator)
    chunks = []
    for part in raw:
        part = part.strip()
        if part:
            chunks.append(part)
    return chunks
Character
fast, format-agnostic
  • Works on any text
  • No parsing required
  • May split mid-sentence
  • Best for dense, unstructured text
Sentence
grammar-aware
  • Respects sentence boundaries
  • Needs reliable sentence splitting
  • Variable chunk length
  • Best for prose documents
Section
structure-aware
  • Uses document headings
  • Semantically coherent chunks
  • Fails if no headings exist
  • Best for Markdown, wikis, docs
No API keys needed
Chunking is pure Python — no external API calls. Run these functions locally on any text file. You will wire them into the full pipeline in lesson 06.
Knowledge Check
A 200-page technical PDF has been converted to plain text with no headings. Which chunking strategy is most appropriate?
Recap — what you just learned
  • Chunking splits documents so only relevant pieces are retrieved, saving tokens and improving quality
  • Character chunking is format-agnostic and fast; use overlap to avoid losing context at seams
  • Sentence chunking respects grammar; best for prose
  • Section chunking uses document structure; best for Markdown and wikis
Next up: Text Embeddings