You built a RAG pipeline. It works in demos. Your team is impressed by the chatbot that answers questions about company policy docs.
Then a customer asks: "What's the difference between the refund policy for enterprise customers and the one for self-serve accounts?"
Your pipeline retrieves the enterprise refund policy, ignores the self-serve one entirely, and confidently presents half an answer as the whole truth.
That's not a bug. That's naive RAG working exactly as designed. It embedded a query, found the top-K most similar chunks, and generated from whatever came back. It has no concept of "this question requires two documents."
This article is the investigation into why that happens, and the engineering that fixes it. We'll follow our refund policy agent from failure to fix, through the research that proves the fix works (94.5% accuracy versus 42% for the broken version), and into production code you can ship today.
In this article
- The 94.5% number
- Why naive RAG breaks
- What agentic RAG actually means
- Building the baseline
- Building agentic RAG
- The refund query, solved
- Self-correction: the closed loop
- The cost-accuracy tradeoff
- When NOT to use agentic RAG
- Migration: three stages
- Where this is heading
The 94.5% number
Before we dig into architecture, let's start with the evidence that made the industry pay attention.
The A-RAG framework (arXiv:2602.03442, February 2026) ran a controlled experiment. Same questions. Same documents. Same LLM. The only variable: whether the model controlled its own retrieval decisions.
| Benchmark | Naive RAG | Agentic RAG | Improvement |
|---|---|---|---|
| HotpotQA | 81.2% | 94.5% | +13.3 pts |
| 2WikiMultiHopQA | 50.2% | 89.7% | +39.5 pts |
| MuSiQue | 52.8% | 74.1% | +21.3 pts |
Read that middle row again. On questions that require reasoning across two Wikipedia articles, naive RAG got the right answer half the time. Agentic RAG got it right nine times out of ten. Same model, same documents, same compute budget.
The critical detail: A-RAG achieved these gains with comparable or fewer retrieved tokens than naive approaches. Better accuracy without higher retrieval costs. The gains came entirely from smarter retrieval decisions: knowing what to search for, when to search again, and when to stop.
This is the refund policy problem at scale. Every multi-hop question, every comparison query, every "tell me about X in the context of Y" is a case where one retrieval pass isn't enough. And research now shows the gap isn't marginal. It's the difference between a system that works and one that doesn't.
Why naive RAG breaks
Back to our refund policy agent. The knowledge base has three chunks:
- Chunk A (Starter): "Starter plan customers receive a full refund within 14 days of purchase."
- Chunk B (Professional): "Professional plan customers can request a prorated refund within 30 days."
- Chunk C (Enterprise): "Enterprise agreements include custom refund terms negotiated per contract."
A customer asks: "Compare the refund policies across all three pricing tiers."
Naive RAG embeds that query, runs cosine similarity, and returns the top 3 results. The query embedding lands closest to Chunk C (because "enterprise," "agreements," and "refund" are semantically dense together). Top-3 retrieval returns Chunk C, Chunk A, and a chunk about pricing features that mentions "refund" once in passing. Chunk B on the Professional tier never appears.
The LLM generates a "comparison" covering Starter and Enterprise while silently dropping Professional. The customer has no idea they're missing a third of the answer. This is the class of failure naive RAG cannot self-diagnose: it has no concept of answer completeness.
A 2026 study by Maxim AI found that 40-60% of RAG implementations fail to reach production due to retrieval quality issues. PremAI's production analysis found 80% of RAG failures trace to the retrieval layer, not the LLM itself.
The failure modes are predictable:
Semantic similarity is not relevance. "How do I cancel my subscription?" is semantically close to "Our subscription plans offer flexibility." Vector search returns the wrong chunk because similar text and relevant text are different things.
Multi-hop questions need multi-step retrieval. Our refund comparison requires three documents. Top-K from a single query embedding surfaces one.
Some queries need no retrieval at all. "What's 15% of $340?" doesn't need a document search. But naive RAG retrieves anyway, polluting context with irrelevant chunks that degrade the answer.
These aren't edge cases. They're the everyday queries that production knowledge systems face. And they share a root cause: the retrieval pipeline has no intelligence.
What agentic RAG means
Agentic RAG is retrieval-augmented generation where the LLM participates in retrieval decisions instead of passively consuming whatever a fixed pipeline returns. The agent decides whether to retrieve, which strategy to use, whether the results are sufficient, and whether to reformulate and try again.
The agentic RAG survey (arXiv:2501.09136) identifies four core patterns:
- Reflection -- the agent evaluates its own retrieval and decides if it's enough
- Planning -- the agent decomposes complex queries into sub-tasks
- Tool use -- the agent selects between multiple retrieval strategies
- Multi-agent collaboration -- specialized agents handle different aspects of a query
For our refund policy agent, this means: instead of embedding the comparison question and hoping for the best, the agent recognizes it needs three separate pieces of information, makes three targeted retrievals, checks that it got all three, and only then generates the answer.
Building the baseline
Let's build the naive pipeline first so we can see exactly where it breaks. This is a condensed version of the RAG-from-scratch tutorial.
You'll need an OpenAI API key, and either a local Qdrant instance or the in-memory store shown below. Total cost to run every example: under $0.10.
TypeScript:
// naive-rag.ts — The fixed embed-search-generate pipeline
import OpenAI from "openai";
const openai = new OpenAI();
interface Chunk {
id: string;
text: string;
embedding: number[];
metadata: { source: string; section: string };
}
const chunks: Chunk[] = []; // populated from your knowledge base
async function embed(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
// This is the problem: one query, one retrieval pass, no evaluation
async function naiveRag(query: string): Promise<string> {
const queryEmbedding = await embed(query);
// Rank all chunks by cosine similarity — no strategy selection
const retrieved = chunks
.map((chunk) => ({ chunk, score: cosineSimilarity(queryEmbedding, chunk.embedding) }))
.sort((a, b) => b.score - a.score)
.slice(0, 3) // top-3, regardless of whether 3 is enough
.map((r) => r.chunk);
const context = retrieved.map((c) => c.text).join("\n\n---\n\n");
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `Answer based ONLY on the provided context. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
},
{ role: "user", content: query },
],
temperature: 0.1,
});
return response.choices[0].message.content ?? "No answer generated";
}Python:
# naive_rag.py — The fixed embed-search-generate pipeline
import numpy as np
from openai import OpenAI
client = OpenAI()
chunks: list[dict] = [] # populated from your knowledge base
def embed(text: str) -> list[float]:
response = client.embeddings.create(model="text-embedding-3-small", input=text)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
# Same problem: one query, one pass, no evaluation
def naive_rag(query: str) -> str:
query_embedding = embed(query)
scored = [(c, cosine_similarity(query_embedding, c["embedding"])) for c in chunks]
scored.sort(key=lambda x: x[1], reverse=True)
retrieved = [c for c, _ in scored[:3]] # top-3, hope for the best
context = "\n\n---\n\n".join(c["text"] for c in retrieved)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer based ONLY on the provided context.\n\nContext:\n{context}"},
{"role": "user", "content": query},
],
temperature=0.1,
)
return response.choices[0].message.content or "No answer generated"This works for "What is our refund policy?" It fails for "Compare refund policies across all three tiers" because retrieve() runs once, returns chunks about one or two tiers, and has no mechanism to realize it's missing the rest.
Building agentic RAG
The core upgrade: treat retrieval as a set of tools the agent can call, rather than a fixed pipeline step. We define multiple retrieval strategies, give them to the LLM via function calling, and let the model decide which to use and when.
Step 1: Define retrieval tools
TypeScript:
// agentic-rag.ts — Retrieval as tools, not a pipeline
import OpenAI from "openai";
import type { ChatCompletionTool } from "openai/resources/chat/completions";
const openai = new OpenAI();
// Each tool represents a different retrieval STRATEGY
const retrievalTools: ChatCompletionTool[] = [
{
type: "function",
function: {
name: "semantic_search",
// Agent sees this description and decides when to use it
description: "Search by meaning. Best for concepts and explanations.",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search query optimized for semantic retrieval." },
top_k: { type: "number", description: "Results to return. 3 focused, 5-7 broad.", default: 3 },
},
required: ["query"],
},
},
},
{
type: "function",
function: {
name: "keyword_search",
// For exact terms the agent knows appear in the docs
description: "Exact keyword matching. Best for product names, policy numbers, specific terms.",
parameters: {
type: "object",
properties: {
keywords: { type: "string", description: "Space-separated keywords." },
},
required: ["keywords"],
},
},
},
{
type: "function",
function: {
name: "metadata_filter",
// Surgical retrieval: go straight to a known document section
description: "Filter by document source or section before searching.",
parameters: {
type: "object",
properties: {
source: { type: "string", description: "Document source name." },
section: { type: "string", description: "Section within a document." },
},
required: [],
},
},
},
{
type: "function",
function: {
name: "no_retrieval_needed",
// Prevents unnecessary retrieval on math/greeting queries
description: "Use when the question needs no document search (math, greetings, clarifications).",
parameters: {
type: "object",
properties: {
reason: { type: "string", description: "Why no retrieval is needed." },
},
required: ["reason"],
},
},
},
];Python:
# agentic_rag.py — Retrieval as tools, not a pipeline
from openai import OpenAI
client = OpenAI()
# Each tool = a different retrieval strategy the agent can choose
retrieval_tools = [
{
"type": "function",
"function": {
"name": "semantic_search",
"description": "Search by meaning. Best for concepts and explanations.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Query optimized for semantic retrieval."},
"top_k": {"type": "number", "description": "Results count. 3 focused, 5-7 broad.", "default": 3},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "keyword_search",
"description": "Exact keyword matching. Best for product names, policy numbers.",
"parameters": {
"type": "object",
"properties": {
"keywords": {"type": "string", "description": "Space-separated keywords."},
},
"required": ["keywords"],
},
},
},
{
"type": "function",
"function": {
"name": "metadata_filter",
"description": "Filter by document source or section before searching.",
"parameters": {
"type": "object",
"properties": {
"source": {"type": "string", "description": "Document source name."},
"section": {"type": "string", "description": "Section within a document."},
},
"required": [],
},
},
},
{
"type": "function",
"function": {
"name": "no_retrieval_needed",
"description": "Use when the question needs no document search.",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string", "description": "Why no retrieval is needed."},
},
"required": ["reason"],
},
},
},
]Step 2: The agentic loop
This is the core. The agent calls tools, evaluates results, and decides whether to retrieve more or answer. We cap iterations to prevent runaway loops.
TypeScript:
// The agentic loop: plan → retrieve → evaluate → repeat or answer
import type {
ChatCompletionMessageParam,
ChatCompletionToolMessageParam,
} from "openai/resources/chat/completions";
const SYSTEM_PROMPT = `You are a research assistant with access to a knowledge base.
Strategy:
1. Analyze what information the question needs.
2. Choose the best retrieval tool(s) for the question type.
3. For comparison questions, search for EACH item separately.
4. After retrieval, evaluate: do you have enough context?
5. If not, reformulate and try again.
6. Once sufficient, answer directly. Never fabricate beyond retrieved docs.`;
async function agenticRag(
query: string,
maxIterations = 3 // hard cap prevents runaway retrieval loops
): Promise<{ answer: string; toolCalls: string[]; iterations: number }> {
const messages: ChatCompletionMessageParam[] = [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: query },
];
const toolCallLog: string[] = [];
let iterations = 0;
while (iterations < maxIterations) {
iterations++;
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages,
tools: retrievalTools,
tool_choice: "auto", // agent decides whether to call a tool or answer
temperature: 0.1,
});
const message = response.choices[0].message;
messages.push(message);
// No tool calls = agent is ready to answer with what it has
if (!message.tool_calls || message.tool_calls.length === 0) {
return { answer: message.content ?? "No answer generated", toolCalls: toolCallLog, iterations };
}
// Execute each tool call and feed results back into conversation
for (const toolCall of message.tool_calls) {
const args = JSON.parse(toolCall.function.arguments);
const result = await executeTool(toolCall.function.name, args);
toolCallLog.push(`${toolCall.function.name}(${JSON.stringify(args)})`);
// Tool result becomes part of the conversation — agent sees it next iteration
const toolMessage: ChatCompletionToolMessageParam = {
role: "tool",
tool_call_id: toolCall.id,
content: result,
};
messages.push(toolMessage);
}
}
// Safety net: force an answer if we hit the iteration cap
const finalResponse = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
...messages,
{ role: "user", content: "Based on all information retrieved, provide your best answer now." },
],
temperature: 0.1,
});
return {
answer: finalResponse.choices[0].message.content ?? "No answer generated",
toolCalls: toolCallLog,
iterations,
};
}Python:
# The agentic loop: plan → retrieve → evaluate → repeat or answer
import json
SYSTEM_PROMPT = """You are a research assistant with access to a knowledge base.
Strategy:
1. Analyze what information the question needs.
2. Choose the best retrieval tool(s) for the question type.
3. For comparison questions, search for EACH item separately.
4. After retrieval, evaluate: do you have enough context?
5. If not, reformulate and try again.
6. Once sufficient, answer directly. Never fabricate beyond retrieved docs."""
def agentic_rag(query: str, max_iterations: int = 3) -> dict:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query},
]
tool_call_log = []
iterations = 0
while iterations < max_iterations:
iterations += 1
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=retrieval_tools,
tool_choice="auto", # agent decides: tool call or answer
temperature=0.1,
)
message = response.choices[0].message
messages.append(message)
# No tool calls = agent is ready to answer
if not message.tool_calls:
return {"answer": message.content or "No answer", "tool_calls": tool_call_log, "iterations": iterations}
# Execute tools, feed results back into conversation history
for tool_call in message.tool_calls:
args = json.loads(tool_call.function.arguments)
result = execute_tool(tool_call.function.name, args)
tool_call_log.append(f"{tool_call.function.name}({json.dumps(args)})")
messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})
# Safety net: force answer at iteration cap
messages.append({"role": "user", "content": "Based on all information retrieved, provide your best answer now."})
final = client.chat.completions.create(model="gpt-4o-mini", messages=messages, temperature=0.1)
return {"answer": final.choices[0].message.content or "No answer", "tool_calls": tool_call_log, "iterations": iterations}The refund query, solved
Now let's trace the exact query that broke naive RAG: "Compare the refund policies across our three pricing tiers."
The agent makes three separate retrieval calls -- one per tier -- using metadata filters. It recognizes this is a comparison question and acts accordingly. Naive RAG would have made one semantic search and returned whichever tier's refund policy happened to be closest to the query embedding.
For a simple lookup like "What is the enterprise refund policy?", the agent calls semantic_search once and answers in one iteration. Same cost as naive RAG. The intelligence only adds cost when it needs to.
For a multi-hop question like "Which pricing tier includes the features from our latest product announcement?", the agent chains: keyword search for the announcement, semantic search for pricing tiers that mention those features, metadata filter to fill in gaps. Three iterations, three different strategies, one complete answer.
Self-correction: the closed loop
The self-correction evaluator is what separates basic function-calling from true agentic RAG. After each retrieval, the agent explicitly assesses whether results are sufficient.
TypeScript:
// Evaluation tool — agent calls this to assess its own retrieval quality
const evaluationTool: ChatCompletionTool = {
type: "function",
function: {
name: "evaluate_retrieval",
description: "Assess whether retrieved context is sufficient. Call after each retrieval.",
parameters: {
type: "object",
properties: {
retrieved_context_summary: { type: "string", description: "What was retrieved so far." },
missing_information: { type: "string", description: "What's still needed. 'none' if sufficient." },
confidence: { type: "number", description: "0-1 confidence the context can answer the question." },
// This is the key decision: does the agent search more or answer now?
next_action: {
type: "string",
enum: ["answer_now", "search_more", "refine_query"],
description: "What to do next.",
},
},
required: ["retrieved_context_summary", "missing_information", "confidence", "next_action"],
},
},
};
// When the agent calls evaluate_retrieval, parse its self-assessment
function handleEvaluation(args: {
confidence: number;
next_action: "answer_now" | "search_more" | "refine_query";
missing_information: string;
}): { shouldContinue: boolean; feedback: string } {
console.log(`[Eval] confidence=${args.confidence} action=${args.next_action}`);
// High confidence + answer_now = stop retrieving
if (args.confidence >= 0.8 && args.next_action === "answer_now") {
return { shouldContinue: false, feedback: "Context sufficient. Generate your answer." };
}
// Agent identified a gap — tell it to target that gap specifically
if (args.next_action === "refine_query") {
return { shouldContinue: true, feedback: `Missing: ${args.missing_information}. Reformulate to target this gap.` };
}
return { shouldContinue: true, feedback: `Confidence: ${args.confidence}. Try a different strategy.` };
}Python:
# When the agent calls evaluate_retrieval, parse its self-assessment
def handle_evaluation(args: dict) -> tuple[bool, str]:
"""Returns (should_continue, feedback_to_agent)."""
confidence = args["confidence"]
action = args["next_action"]
missing = args["missing_information"]
print(f"[Eval] confidence={confidence} action={action}")
# High confidence = stop retrieving, generate answer
if confidence >= 0.8 and action == "answer_now":
return False, "Context sufficient. Generate your answer."
# Agent found a gap — direct it to fill that specific gap
if action == "refine_query":
return True, f"Missing: {missing}. Reformulate to target this gap."
return True, f"Confidence: {confidence}. Try a different retrieval strategy."Add this tool to your retrievalTools array. When the agent calls evaluate_retrieval, feed the feedback string back as the tool response. The agent sees the evaluation in its conversation history and adjusts its next retrieval. Retrieve, evaluate, refine, repeat. A closed loop.
For our refund policy agent, this is the moment it catches itself. After retrieving Starter and Enterprise policies, the evaluation fires: confidence=0.6, missing_information="Professional tier refund policy", next_action="search_more". The agent makes one more targeted retrieval and gets the complete picture.
Iterative retrieval with self-correction has been shown to yield gains of up to 25.6 percentage points over single-pass retrieval on multi-hop scientific QA tasks, specifically by catching late-hop failures and correcting hypothesis drift.
The cost-accuracy tradeoff
Agentic RAG costs more per query. Here's exactly where the extra tokens go, and why the math still works.
| Metric | Naive RAG | Agentic RAG |
|---|---|---|
| Cost per query | $0.002 | $0.007 |
| Daily cost (10K queries) | $20 | $70 |
| Monthly cost | $600 | $2,100 |
| Accuracy (simple queries, ~65% of traffic) | 85% | 88% |
| Accuracy (multi-hop queries, ~35% of traffic) | 52% | 87% |
| Weighted accuracy | 73.5% | 87.7% |
| Wrong answers per day | 2,650 | 1,230 |
The daily compute increase is $50. But 1,420 fewer wrong answers per day means 1,420 fewer potential escalations. If each human escalation costs $5-15, the daily savings ($7,100-$21,300) dwarf the compute cost by two orders of magnitude.
The key optimization: route queries to the right pipeline. A lightweight classifier sends simple queries through naive RAG at $0.002 each, and complex queries through agentic RAG at $0.007. You get 90%+ of the accuracy improvement at 40% of the full agentic cost.
When NOT to use it
Agentic RAG adds latency (2-6 seconds vs. 0.5-3 seconds), cost (3.5x), and complexity. Sometimes that tradeoff is a bad deal.
Single-document lookups. If 90%+ of queries are "What is X?" and the answer lives in one chunk, the agent's planning step adds latency for zero accuracy gain.
Low-stakes internal tools. An employee FAQ bot where wrong answers get corrected casually doesn't justify the compute cost. Save agentic RAG for customer-facing systems where wrong answers have consequences.
Small, uniform knowledge bases. Fifty pages of product docs in a consistent format? Top-K with good chunking covers it. Agentic RAG shines when documents vary in structure, authority, and scope.
Latency-critical paths. Voice agents with sub-2-second requirements can't afford 3-5 seconds of agentic retrieval. Use naive RAG with a reranker (+200ms, +10-30% precision) as a middle ground.
The decision framework: Measure your multi-hop failure rate. Below 15%, naive RAG is fine. Between 15-30%, add a reranker. Above 30%, build the agentic layer.
Migration: three stages
You don't rebuild from scratch. Upgrade incrementally, measuring at each stage.
Stage 1: Add a reranker (1 day). Keep naive RAG. After top-K retrieval, add a cross-encoder reranker to re-score chunks. Catches "semantically similar but irrelevant" failures. Expected: +10-30% retrieval precision. Latency: +200ms.
Stage 2: Add query routing (2-3 days). Classify each query as simple, multi-hop, or no-retrieval. Route accordingly.
// Stage 2: Query router — cheapest path to agentic gains
async function hybridRag(query: string): Promise<string> {
// One cheap LLM call to classify the query type
const classification = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `Classify this query into one category:
- "simple": single fact, one document
- "multi_hop": needs 2+ documents or comparison
- "no_retrieval": answerable from general knowledge
Respond with just the category name.`,
},
{ role: "user", content: query },
],
temperature: 0,
max_tokens: 20,
});
const queryType = classification.choices[0].message.content?.trim() ?? "simple";
// Only complex queries pay the agentic cost
switch (queryType) {
case "no_retrieval": return directLlmAnswer(query);
case "multi_hop": return (await agenticRag(query)).answer;
default: return naiveRag(query);
}
}Stage 3: Add self-correction (1-2 days). Add the evaluate_retrieval tool to your agentic loop. Now the agent catches its own gaps. This is the full pattern from this article.
At each stage, run your evaluation set. If Stage 1 gets you above 85% on multi-hop queries, you may not need Stage 3. Let the numbers decide.
Where this is heading
The A-RAG paper revealed something important: agentic retrieval performance scales with model capability. When they upgraded from GPT-4o-mini to GPT-5-mini, accuracy on 2WikiMultiHopQA jumped from 60.2% to 89.7%. The architecture didn't change. The agent just made better retrieval decisions.
This means the pipeline you build today gets better as models improve, without architectural changes. The tools stay the same. The loop stays the same. The decisions get sharper.
Back to our refund policy agent: it started by confidently delivering half an answer. Now it decomposes the question, retrieves all three policies, evaluates its own completeness, and delivers the full comparison. Same documents. Same LLM. The only difference is who controls the retrieval decisions.
That's the trajectory. RAG started as a pipeline. It's becoming a team.
Ready to build? Start with the naive RAG baseline from the RAG-from-scratch tutorial. When your evaluation shows multi-hop accuracy below 70%, come back here and add the agentic layer. If you're building agents that need retrieval alongside tools and memory, the combination of agentic RAG with MCP tool execution is where production systems are heading.
Build agents with knowledge, memory, and tools
Chanl connects your AI agents to knowledge bases, persistent memory, and external tools through MCP. Build the retrieval layer once, monitor it in production, and improve it with real conversation data.
Start buildingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



