Fasttrack· 06· 30 min

RAG: Ask Questions About Your PDFs

What you'll learn
  • Understand the full Retrieval-Augmented Generation pipeline
  • Load and split documents into chunks
  • Embed, store, retrieve, and generate answers from your own data

Your LLM does not know about your company docs, research papers, internal wikis, or any document created after its training cutoff. RAG (Retrieval-Augmented Generation) bridges this gap. Instead of retraining the model, you retrieve the relevant passages from your documents at query time and inject them into the prompt. The model answers based on what you put in front of it.

Full RAG pipeline diagram: Load, Split, Embed, Store, Retrieve, Generate
Click to zoom
The complete RAG pipeline: Load → Split → Embed → Store → Retrieve → Generate
The 6-step RAG pipeline
Load
PDF → Documents
Split
chunks of ~1000 chars
Embed
text → vectors
Store
vectors in Chroma
Retrieve
top-k similar chunks
Generate
answer with context

Step 1-2: Load and split the PDF

PyPDFLoader reads the PDF and returns a list of Document objects, one per page. RecursiveCharacterTextSplitter then divides each page into overlapping chunks. The overlap (150 characters) ensures that concepts spanning a chunk boundary are not lost — both chunks contain the bridging text.

rag_pipeline.py
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Step 1: Load
loader = PyPDFLoader('docs/research_paper.pdf')
documents = loader.load()   # list of Document objects, one per page

# Step 2: Split into overlapping chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # ~750 words per chunk
    chunk_overlap=150,  # overlap prevents losing ideas at boundaries
)
chunks = splitter.split_documents(documents)
print(f'Created {len(chunks)} chunks from {len(documents)} pages')

Step 3-4: Embed and store in a vector database

Diagram showing text chunks being converted to vectors in a 2D embedding space
Click to zoom
Each chunk becomes a vector — similar meanings are close together in vector space

An embedding model converts each text chunk into a list of numbers (a vector) that encodes its meaning. Chunks about similar topics end up numerically close together in this vector space. Chroma stores these vectors on disk so you only compute them once, not on every query.

rag_pipeline.py
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Step 3: Create embeddings
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

# Step 4: Store in a vector database (persisted to disk)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory='./chroma_db',
)
print(f'Stored {vectorstore._collection.count()} vectors')

Step 5-6: Retrieve and generate

Diagram showing a query retrieving relevant chunks and injecting them into the prompt
Click to zoom
Only relevant chunks are injected into the prompt — the LLM answers from your data
rag_pipeline.py
from langchain_core.messages import HumanMessage

def ask(question: str) -> str:
    # Step 5: Retrieve the 3 most relevant chunks
    relevant_chunks = vectorstore.similarity_search(question, k=3)  # ①
    context = '\n\n---\n\n'.join(chunk.page_content for chunk in relevant_chunks)  # ②

    # Step 6: Generate an answer grounded in the retrieved context
    prompt = (
        f'Answer the question based ONLY on the context below.\n\n'
        f'Context:\n{context}\n\n'
        f'Question: {question}'
    )
    response = model.invoke([HumanMessage(content=prompt)])          # ③
    return response.content

print(ask('What are the key findings of the paper?'))
① similarity_search finds the top-k chunks whose embeddings are closest to the question embedding
② Join the chunks with a separator so the LLM can distinguish where each chunk ends
③ The model only sees the retrieved context — not the entire document — making answers fast and precise
The model only sees what you put in the prompt — retrieval quality drives answer quality. If the wrong chunks are retrieved, the model will give wrong answers even with a perfect LLM. Invest time tuning chunk size, overlap, and embedding model choice before optimizing the LLM.
Knowledge Check
What does the retrieval step in RAG actually do?
Recap — what you just learned
  • RAG = Load → Split → Embed → Store → Retrieve → Generate — six clear steps
  • Chunks overlap so ideas at boundaries are not lost between adjacent chunks
  • Embeddings convert text to vectors — similar meanings are numerically close
  • Only retrieved chunks go into the prompt — retrieval quality determines answer quality
Next up: 07 — Your First Agent