Session 11· 06· 30 min
Complete RAG Pipeline
What you'll learn
- ▸Wire all six RAG stages into an end-to-end pipeline
- ▸Prompt Claude with retrieved context using XML tags
- ▸Handle graceful refusal when retrieved context is insufficient
The complete pipeline
RAG Pipeline
1. Chunk
split document
2. Embed
generate vectors
3. Index
VectorIndex + BM25
4. Retrieve
hybrid RRF search
5. Prompt
inject context
6. Generate
Claude answers
Building the prompt with XML tags
rag_pipeline.py
def build_prompt(query: str, chunks: list[str]) -> str:
context_blocks = ''
for i, chunk in enumerate(chunks, start=1):
context_blocks += f'<chunk id="{i}">\n{chunk}\n</chunk>\n'
return (
'<context>\n'
+ context_blocks
+ '</context>\n\n'
+ 'Answer the question using ONLY the information in <context>.\n'
+ 'If the context does not contain enough information, say '
+ '"I don\'t have enough information to answer that."\n\n'
+ f'Question: {query}'
)①XML tags clearly separate context from instructions — the model cannot confuse them
②Numbering chunks with id= makes it easier to trace which chunk drove the answer
③The "ONLY the information" instruction prevents hallucination from pre-training knowledge
④The explicit refusal instruction teaches the model to say it does not know rather than fabricate
ask_with_rag function
rag_pipeline.py
import anthropic
anthropic_client = anthropic.Anthropic()
def ask_with_rag(query: str, retriever: Retriever, top_k: int = 3) -> str:
chunks = retriever.search(query, top_k=top_k)
prompt = build_prompt(query, chunks)
message = anthropic_client.messages.create(
model='claude-opus-4-5',
max_tokens=512,
messages=[{'role': 'user', 'content': prompt}],
)
return message.content[0].textExample queries
rag_pipeline.py
document = """
## Attention Mechanisms
Transformers use self-attention to relate all token positions directly.
This avoids the sequential bottleneck of RNNs and enables parallelism.
## Scaling Laws
Model capability scales predictably with parameters and training compute.
The Chinchilla paper showed training requires more data than previously assumed.
## Fine-Tuning
Fine-tuning adapts a pre-trained model to a specific task with a small dataset.
LoRA reduces fine-tuning cost by training low-rank adapter matrices only.
"""
chunks = chunk_by_section(document)
retriever = Retriever(chunks)
# Query 1: well-covered topic
print(ask_with_rag('What advantage does self-attention have over RNNs?', retriever))
# Query 2: exact token
print(ask_with_rag('What does LoRA stand for?', retriever))
# Query 3: out-of-scope — expect refusal
print(ask_with_rag('What is the capital of France?', retriever))Graceful refusal is a feature, not a bug
A RAG system that says "I don't have enough information" is far more trustworthy than one that hallucinates a plausible-sounding answer. The explicit refusal instruction in build_prompt is what makes this behaviour reliable.
Knowledge Check
The RAG system returns a hallucinated answer despite the "ONLY the information in <context>" instruction. What is the most likely cause?
Recap — what you just learned
- ✓The 6 RAG stages are: chunk, embed, index, retrieve, prompt, generate
- ✓XML tags in build_prompt clearly separate context from instructions
- ✓The "ONLY the information in context" rule prevents hallucination; the refusal instruction handles gaps
- ✓ask_with_rag retrieves top-k chunks, builds the prompt, and calls Claude in under 10 lines
- ✓In production, replace in-memory indexes with vector databases and Elasticsearch for scale