Session 11· 06· 30 min

Complete RAG Pipeline

What you'll learn

▸Wire all six RAG stages into an end-to-end pipeline
▸Prompt Claude with retrieved context using XML tags
▸Handle graceful refusal when retrieved context is insufficient

The complete pipeline

RAG Pipeline

1. Chunk

split document

2. Embed

generate vectors

3. Index

VectorIndex + BM25

4. Retrieve

hybrid RRF search

5. Prompt

inject context

6. Generate

Claude answers

Building the prompt with XML tags

rag_pipeline.py

def build_prompt(query: str, chunks: list[str]) -> str:
    context_blocks = ''
    for i, chunk in enumerate(chunks, start=1):
        context_blocks += f'<chunk id="{i}">\n{chunk}\n</chunk>\n'
    return (
        '<context>\n'
        + context_blocks
        + '</context>\n\n'
        + 'Answer the question using ONLY the information in <context>.\n'
        + 'If the context does not contain enough information, say '
        + '"I don\'t have enough information to answer that."\n\n'
        + f'Question: {query}'
    )

①XML tags clearly separate context from instructions — the model cannot confuse them

②Numbering chunks with id= makes it easier to trace which chunk drove the answer

③The "ONLY the information" instruction prevents hallucination from pre-training knowledge

④The explicit refusal instruction teaches the model to say it does not know rather than fabricate

ask_with_rag function

rag_pipeline.py

import anthropic

anthropic_client = anthropic.Anthropic()

def ask_with_rag(query: str, retriever: Retriever, top_k: int = 3) -> str:
    chunks = retriever.search(query, top_k=top_k)
    prompt = build_prompt(query, chunks)
    message = anthropic_client.messages.create(
        model='claude-opus-4-5',
        max_tokens=512,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return message.content[0].text

Example queries

rag_pipeline.py

document = """
## Attention Mechanisms
Transformers use self-attention to relate all token positions directly.
This avoids the sequential bottleneck of RNNs and enables parallelism.

## Scaling Laws
Model capability scales predictably with parameters and training compute.
The Chinchilla paper showed training requires more data than previously assumed.

## Fine-Tuning
Fine-tuning adapts a pre-trained model to a specific task with a small dataset.
LoRA reduces fine-tuning cost by training low-rank adapter matrices only.
"""

chunks = chunk_by_section(document)
retriever = Retriever(chunks)

# Query 1: well-covered topic
print(ask_with_rag('What advantage does self-attention have over RNNs?', retriever))

# Query 2: exact token
print(ask_with_rag('What does LoRA stand for?', retriever))

# Query 3: out-of-scope — expect refusal
print(ask_with_rag('What is the capital of France?', retriever))

Graceful refusal is a feature, not a bug

A RAG system that says "I don't have enough information" is far more trustworthy than one that hallucinates a plausible-sounding answer. The explicit refusal instruction in build_prompt is what makes this behaviour reliable.

Knowledge Check

The RAG system returns a hallucinated answer despite the "ONLY the information in <context>" instruction. What is the most likely cause?

Recap — what you just learned

✓The 6 RAG stages are: chunk, embed, index, retrieve, prompt, generate
✓XML tags in build_prompt clearly separate context from instructions
✓The "ONLY the information in context" rule prevents hallucination; the refusal instruction handles gaps
✓ask_with_rag retrieves top-k chunks, builds the prompt, and calls Claude in under 10 lines
✓In production, replace in-memory indexes with vector databases and Elasticsearch for scale