Session 11· 06· 30 min

Complete RAG Pipeline

What you'll learn
  • Wire all six RAG stages into an end-to-end pipeline
  • Prompt Claude with retrieved context using XML tags
  • Handle graceful refusal when retrieved context is insufficient

The complete pipeline

RAG Pipeline
1. Chunk
split document
2. Embed
generate vectors
3. Index
VectorIndex + BM25
4. Retrieve
hybrid RRF search
5. Prompt
inject context
6. Generate
Claude answers

Building the prompt with XML tags

rag_pipeline.py
def build_prompt(query: str, chunks: list[str]) -> str:
    context_blocks = ''
    for i, chunk in enumerate(chunks, start=1):
        context_blocks += f'<chunk id="{i}">\n{chunk}\n</chunk>\n'
    return (
        '<context>\n'
        + context_blocks
        + '</context>\n\n'
        + 'Answer the question using ONLY the information in <context>.\n'
        + 'If the context does not contain enough information, say '
        + '"I don\'t have enough information to answer that."\n\n'
        + f'Question: {query}'
    )
XML tags clearly separate context from instructions — the model cannot confuse them
Numbering chunks with id= makes it easier to trace which chunk drove the answer
The "ONLY the information" instruction prevents hallucination from pre-training knowledge
The explicit refusal instruction teaches the model to say it does not know rather than fabricate

ask_with_rag function

rag_pipeline.py
import anthropic

anthropic_client = anthropic.Anthropic()

def ask_with_rag(query: str, retriever: Retriever, top_k: int = 3) -> str:
    chunks = retriever.search(query, top_k=top_k)
    prompt = build_prompt(query, chunks)
    message = anthropic_client.messages.create(
        model='claude-opus-4-5',
        max_tokens=512,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return message.content[0].text

Example queries

rag_pipeline.py
document = """
## Attention Mechanisms
Transformers use self-attention to relate all token positions directly.
This avoids the sequential bottleneck of RNNs and enables parallelism.

## Scaling Laws
Model capability scales predictably with parameters and training compute.
The Chinchilla paper showed training requires more data than previously assumed.

## Fine-Tuning
Fine-tuning adapts a pre-trained model to a specific task with a small dataset.
LoRA reduces fine-tuning cost by training low-rank adapter matrices only.
"""

chunks = chunk_by_section(document)
retriever = Retriever(chunks)

# Query 1: well-covered topic
print(ask_with_rag('What advantage does self-attention have over RNNs?', retriever))

# Query 2: exact token
print(ask_with_rag('What does LoRA stand for?', retriever))

# Query 3: out-of-scope — expect refusal
print(ask_with_rag('What is the capital of France?', retriever))
Graceful refusal is a feature, not a bug
A RAG system that says "I don't have enough information" is far more trustworthy than one that hallucinates a plausible-sounding answer. The explicit refusal instruction in build_prompt is what makes this behaviour reliable.
Knowledge Check
The RAG system returns a hallucinated answer despite the "ONLY the information in <context>" instruction. What is the most likely cause?
Recap — what you just learned
  • The 6 RAG stages are: chunk, embed, index, retrieve, prompt, generate
  • XML tags in build_prompt clearly separate context from instructions
  • The "ONLY the information in context" rule prevents hallucination; the refusal instruction handles gaps
  • ask_with_rag retrieves top-k chunks, builds the prompt, and calls Claude in under 10 lines
  • In production, replace in-memory indexes with vector databases and Elasticsearch for scale