Session 11· 01· 20 min

Chunking a Document

What you'll learn

▸Explain why chunking is needed before embedding
▸Implement three chunking strategies: character, sentence, and section
▸Choose the right strategy for different document types

Session overview

Session 11 covers RAG — Retrieval Augmented Generation. Instead of stuffing an entire document into a prompt, RAG retrieves only the relevant pieces. Chunking is the first step: splitting documents into pieces small enough to embed efficiently.

Why chunk?

Language model context windows have token limits. Even when a document fits, sending irrelevant paragraphs wastes tokens, increases cost, and dilutes the signal for the model. Chunking splits a document into small, meaningful pieces so only the most relevant pieces are retrieved and sent.

Size limits — embedding models cap at 8k tokens; generation models charge per token
Quality — smaller, focused chunks produce more accurate similarity scores
Cost — retrieving 3 chunks costs far less than sending 300 pages

Strategy 1 — Character chunking

chunking.py

def chunk_by_char(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

Strategy 2 — Sentence chunking

chunking.py

import re

def chunk_by_sentence(text: str, max_sentences: int = 5) -> list[str]:
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = ' '.join(sentences[i:i + max_sentences])
        if chunk:
            chunks.append(chunk)
    return chunks

Strategy 3 — Section chunking

chunking.py

def chunk_by_section(text: str, separator: str = '\n## ') -> list[str]:
    raw = text.split(separator)
    chunks = []
    for part in raw:
        part = part.strip()
        if part:
            chunks.append(part)
    return chunks

Character

fast, format-agnostic

•Works on any text
•No parsing required
•May split mid-sentence
•Best for dense, unstructured text

Sentence

grammar-aware

•Respects sentence boundaries
•Needs reliable sentence splitting
•Variable chunk length
•Best for prose documents

Section

structure-aware

•Uses document headings
•Semantically coherent chunks
•Fails if no headings exist
•Best for Markdown, wikis, docs

No API keys needed

Chunking is pure Python — no external API calls. Run these functions locally on any text file. You will wire them into the full pipeline in lesson 06.

Knowledge Check

A 200-page technical PDF has been converted to plain text with no headings. Which chunking strategy is most appropriate?

Recap — what you just learned

✓Chunking splits documents so only relevant pieces are retrieved, saving tokens and improving quality
✓Character chunking is format-agnostic and fast; use overlap to avoid losing context at seams
✓Sentence chunking respects grammar; best for prose
✓Section chunking uses document structure; best for Markdown and wikis

Next up: Text Embeddings