What is agentic RAG and how is it different from regular RAG?

Agentic RAG gives the language model control over retrieval decisions. Instead of a fixed pipeline (embed query, search top-K, generate), the agent decides whether to retrieve, which retrieval strategy to use (keyword, semantic, or structured query), whether the results are sufficient, and whether to reformulate and try again. The agent treats retrieval as a tool it can call, not a static step.

When should I upgrade from naive RAG to agentic RAG?

Upgrade when you see three signals: users asking multi-hop questions that span multiple documents, retrieval returning semantically similar but factually wrong chunks, or accuracy below 80% on your evaluation set. If your queries are simple single-document lookups, naive RAG is fine. The moment queries require reasoning about what to search for, you need agentic retrieval.

How much does agentic RAG cost compared to naive RAG?

Agentic RAG typically costs 2-5x more per query because the agent makes multiple LLM calls for planning, evaluation, and potential re-retrieval. A naive RAG query might cost $0.002 while an agentic query costs $0.005-0.01. However, the accuracy improvement from 60% to 90%+ often means fewer escalations and corrections, making the total cost of ownership lower.

Does agentic RAG eliminate hallucinations?

No, but it significantly reduces them. The self-correction loop catches cases where retrieved context doesn't support the answer and triggers re-retrieval. Research shows agentic RAG achieves 89-94% accuracy on multi-hop benchmarks where naive RAG scores 42-74%. Hallucinations from insufficient retrieval drop dramatically, but the model can still hallucinate from the retrieved context itself.

What vector database should I use for agentic RAG?

The vector database matters less than the retrieval interface design. Pinecone, Qdrant, Weaviate, and pgvector all work. What matters is exposing multiple retrieval modes (keyword search, semantic search, metadata filtering) as separate tools the agent can call. A simple vector store with well-designed tool interfaces outperforms a sophisticated database with a single search endpoint.

Can I implement agentic RAG without LangChain or LlamaIndex?

Yes. This article builds agentic RAG from scratch using only the OpenAI SDK and a vector store. The core pattern is straightforward: define retrieval tools, give them to the LLM via function calling, and let the model decide which to use. Frameworks add convenience but hide the decision-making logic that makes agentic RAG work.

How do I evaluate agentic RAG vs naive RAG for my use case?

Build a test set of 50-100 questions with known correct answers, including at least 30% multi-hop questions. Run both pipelines on the same set and measure accuracy, retrieval precision, and average tokens consumed. If agentic RAG doesn't improve accuracy by at least 10 percentage points, your queries may not be complex enough to justify the added cost.

What is the latency impact of agentic RAG?

Agentic RAG adds 1-3 seconds per query due to the planning step, retrieval evaluation, and potential re-retrieval loops. A typical naive RAG query takes 0.5-3 seconds. An agentic query takes 2-6 seconds. For real-time chat, cap the agent at 2-3 retrieval iterations. For batch processing or async workflows, let it run longer for higher accuracy.

Your RAG Pipeline Is Answering the Wrong Question

You built a RAG pipeline. It works in demos. Your team is impressed by the chatbot that answers questions about company policy docs.

Then a customer asks: "What's the difference between the refund policy for enterprise customers and the one for self-serve accounts?"

Your pipeline retrieves the enterprise refund policy, ignores the self-serve one entirely, and confidently presents half an answer as the whole truth.

That's not a bug. That's naive RAG working exactly as designed. It embedded a query, found the top-K most similar chunks, and generated from whatever came back. It has no concept of "this question requires two documents."

This article is the investigation into why that happens, and the engineering that fixes it. We'll follow our refund policy agent from failure to fix, through the research that proves the fix works (94.5% accuracy versus 42% for the broken version), and into production code you can ship today.

The 94.5% number

Before we dig into architecture, let's start with the evidence that made the industry pay attention.

The A-RAG framework (arXiv:2602.03442, February 2026) ran a controlled experiment. Same questions. Same documents. Same LLM. The only variable: whether the model controlled its own retrieval decisions.

Benchmark	Naive RAG	Agentic RAG	Improvement
HotpotQA	81.2%	94.5%	+13.3 pts
2WikiMultiHopQA	50.2%	89.7%	+39.5 pts
MuSiQue	52.8%	74.1%	+21.3 pts

Read that middle row again. On questions that require reasoning across two Wikipedia articles, naive RAG got the right answer half the time. Agentic RAG got it right nine times out of ten. Same model, same documents, same compute budget.

The critical detail: A-RAG achieved these gains with comparable or fewer retrieved tokens than naive approaches. Better accuracy without higher retrieval costs. The gains came entirely from smarter retrieval decisions: knowing what to search for, when to search again, and when to stop.

This is the refund policy problem at scale. Every multi-hop question, every comparison query, every "tell me about X in the context of Y" is a case where one retrieval pass isn't enough. And research now shows the gap isn't marginal. It's the difference between a system that works and one that doesn't.

Why naive RAG breaks

Back to our refund policy agent. The knowledge base has three chunks:

Chunk A (Starter): "Starter plan customers receive a full refund within 14 days of purchase."
Chunk B (Professional): "Professional plan customers can request a prorated refund within 30 days."
Chunk C (Enterprise): "Enterprise agreements include custom refund terms negotiated per contract."

A customer asks: "Compare the refund policies across all three pricing tiers."

Naive RAG embeds that query, runs cosine similarity, and returns the top 3 results. The query embedding lands closest to Chunk C (because "enterprise," "agreements," and "refund" are semantically dense together). Top-3 retrieval returns Chunk C, Chunk A, and a chunk about pricing features that mentions "refund" once in passing. Chunk B on the Professional tier never appears.

The LLM generates a "comparison" covering Starter and Enterprise while silently dropping Professional. The customer has no idea they're missing a third of the answer. This is the class of failure naive RAG cannot self-diagnose: it has no concept of answer completeness.

A 2026 study by Maxim AI found that 40-60% of RAG implementations fail to reach production due to retrieval quality issues. PremAI's production analysis found 80% of RAG failures trace to the retrieval layer, not the LLM itself.

The failure modes are predictable:

Semantic similarity is not relevance. "How do I cancel my subscription?" is semantically close to "Our subscription plans offer flexibility." Vector search returns the wrong chunk because similar text and relevant text are different things.

Multi-hop questions need multi-step retrieval. Our refund comparison requires three documents. Top-K from a single query embedding surfaces one.

Some queries need no retrieval at all. "What's 15% of $340?" doesn't need a document search. But naive RAG retrieves anyway, polluting context with irrelevant chunks that degrade the answer.

These aren't edge cases. They're the everyday queries that production knowledge systems face. And they share a root cause: the retrieval pipeline has no intelligence.

Naive RAG: one path for every query, blind to complexity

What agentic RAG means

Agentic RAG is retrieval-augmented generation where the LLM participates in retrieval decisions instead of passively consuming whatever a fixed pipeline returns. The agent decides whether to retrieve, which strategy to use, whether the results are sufficient, and whether to reformulate and try again.

The agentic RAG survey (arXiv:2501.09136) identifies four core patterns:

Reflection -- the agent evaluates its own retrieval and decides if it's enough
Planning -- the agent decomposes complex queries into sub-tasks
Tool use -- the agent selects between multiple retrieval strategies
Multi-agent collaboration -- specialized agents handle different aspects of a query

For our refund policy agent, this means: instead of embedding the comparison question and hoping for the best, the agent recognizes it needs three separate pieces of information, makes three targeted retrievals, checks that it got all three, and only then generates the answer.

Agentic RAG: the agent plans, retrieves, evaluates, and re-retrieves

Building the baseline

Let's build the naive pipeline first so we can see exactly where it breaks. This is a condensed version of the RAG-from-scratch tutorial.

You'll need an OpenAI API key, and either a local Qdrant instance or the in-memory store shown below. Total cost to run every example: under $0.10.

TypeScript:

typescript

// naive-rag.ts — The fixed embed-search-generate pipeline
import OpenAI from "openai";
 
const openai = new OpenAI();
 
interface Chunk {
  id: string;
  text: string;
  embedding: number[];
  metadata: { source: string; section: string };
}
 
const chunks: Chunk[] = []; // populated from your knowledge base
 
async function embed(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}
 
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
 
// This is the problem: one query, one retrieval pass, no evaluation
async function naiveRag(query: string): Promise<string> {
  const queryEmbedding = await embed(query);
 
  // Rank all chunks by cosine similarity — no strategy selection
  const retrieved = chunks
    .map((chunk) => ({ chunk, score: cosineSimilarity(queryEmbedding, chunk.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 3) // top-3, regardless of whether 3 is enough
    .map((r) => r.chunk);
 
  const context = retrieved.map((c) => c.text).join("\n\n---\n\n");
 
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Answer based ONLY on the provided context. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
      },
      { role: "user", content: query },
    ],
    temperature: 0.1,
  });
 
  return response.choices[0].message.content ?? "No answer generated";
}

Python:

python

# naive_rag.py — The fixed embed-search-generate pipeline
import numpy as np
from openai import OpenAI
 
client = OpenAI()
chunks: list[dict] = []  # populated from your knowledge base
 
def embed(text: str) -> list[float]:
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return response.data[0].embedding
 
def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
 
# Same problem: one query, one pass, no evaluation
def naive_rag(query: str) -> str:
    query_embedding = embed(query)
    scored = [(c, cosine_similarity(query_embedding, c["embedding"])) for c in chunks]
    scored.sort(key=lambda x: x[1], reverse=True)
    retrieved = [c for c, _ in scored[:3]]  # top-3, hope for the best
 
    context = "\n\n---\n\n".join(c["text"] for c in retrieved)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer based ONLY on the provided context.\n\nContext:\n{context}"},
            {"role": "user", "content": query},
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content or "No answer generated"

This works for "What is our refund policy?" It fails for "Compare refund policies across all three tiers" because retrieve() runs once, returns chunks about one or two tiers, and has no mechanism to realize it's missing the rest.

Building agentic RAG

The core upgrade: treat retrieval as a set of tools the agent can call, rather than a fixed pipeline step. We define multiple retrieval strategies, give them to the LLM via function calling, and let the model decide which to use and when.

Step 1: Define retrieval tools

TypeScript:

typescript

// agentic-rag.ts — Retrieval as tools, not a pipeline
import OpenAI from "openai";
import type { ChatCompletionTool } from "openai/resources/chat/completions";
 
const openai = new OpenAI();
 
// Each tool represents a different retrieval STRATEGY
const retrievalTools: ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "semantic_search",
      // Agent sees this description and decides when to use it
      description: "Search by meaning. Best for concepts and explanations.",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query optimized for semantic retrieval." },
          top_k: { type: "number", description: "Results to return. 3 focused, 5-7 broad.", default: 3 },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "keyword_search",
      // For exact terms the agent knows appear in the docs
      description: "Exact keyword matching. Best for product names, policy numbers, specific terms.",
      parameters: {
        type: "object",
        properties: {
          keywords: { type: "string", description: "Space-separated keywords." },
        },
        required: ["keywords"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "metadata_filter",
      // Surgical retrieval: go straight to a known document section
      description: "Filter by document source or section before searching.",
      parameters: {
        type: "object",
        properties: {
          source: { type: "string", description: "Document source name." },
          section: { type: "string", description: "Section within a document." },
        },
        required: [],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "no_retrieval_needed",
      // Prevents unnecessary retrieval on math/greeting queries
      description: "Use when the question needs no document search (math, greetings, clarifications).",
      parameters: {
        type: "object",
        properties: {
          reason: { type: "string", description: "Why no retrieval is needed." },
        },
        required: ["reason"],
      },
    },
  },
];

Python:

python

# agentic_rag.py — Retrieval as tools, not a pipeline
from openai import OpenAI
 
client = OpenAI()
 
# Each tool = a different retrieval strategy the agent can choose
retrieval_tools = [
    {
        "type": "function",
        "function": {
            "name": "semantic_search",
            "description": "Search by meaning. Best for concepts and explanations.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Query optimized for semantic retrieval."},
                    "top_k": {"type": "number", "description": "Results count. 3 focused, 5-7 broad.", "default": 3},
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "keyword_search",
            "description": "Exact keyword matching. Best for product names, policy numbers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "keywords": {"type": "string", "description": "Space-separated keywords."},
                },
                "required": ["keywords"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "metadata_filter",
            "description": "Filter by document source or section before searching.",
            "parameters": {
                "type": "object",
                "properties": {
                    "source": {"type": "string", "description": "Document source name."},
                    "section": {"type": "string", "description": "Section within a document."},
                },
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "no_retrieval_needed",
            "description": "Use when the question needs no document search.",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string", "description": "Why no retrieval is needed."},
                },
                "required": ["reason"],
            },
        },
    },
]

Step 2: The agentic loop

This is the core. The agent calls tools, evaluates results, and decides whether to retrieve more or answer. We cap iterations to prevent runaway loops.

TypeScript:

typescript

// The agentic loop: plan → retrieve → evaluate → repeat or answer
import type {
  ChatCompletionMessageParam,
  ChatCompletionToolMessageParam,
} from "openai/resources/chat/completions";
 
const SYSTEM_PROMPT = `You are a research assistant with access to a knowledge base.
 
Strategy:
1. Analyze what information the question needs.
2. Choose the best retrieval tool(s) for the question type.
3. For comparison questions, search for EACH item separately.
4. After retrieval, evaluate: do you have enough context?
5. If not, reformulate and try again.
6. Once sufficient, answer directly. Never fabricate beyond retrieved docs.`;
 
async function agenticRag(
  query: string,
  maxIterations = 3 // hard cap prevents runaway retrieval loops
): Promise<{ answer: string; toolCalls: string[]; iterations: number }> {
  const messages: ChatCompletionMessageParam[] = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: query },
  ];
  const toolCallLog: string[] = [];
  let iterations = 0;
 
  while (iterations < maxIterations) {
    iterations++;
 
    const response = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages,
      tools: retrievalTools,
      tool_choice: "auto", // agent decides whether to call a tool or answer
      temperature: 0.1,
    });
 
    const message = response.choices[0].message;
    messages.push(message);
 
    // No tool calls = agent is ready to answer with what it has
    if (!message.tool_calls || message.tool_calls.length === 0) {
      return { answer: message.content ?? "No answer generated", toolCalls: toolCallLog, iterations };
    }
 
    // Execute each tool call and feed results back into conversation
    for (const toolCall of message.tool_calls) {
      const args = JSON.parse(toolCall.function.arguments);
      const result = await executeTool(toolCall.function.name, args);
      toolCallLog.push(`${toolCall.function.name}(${JSON.stringify(args)})`);
 
      // Tool result becomes part of the conversation — agent sees it next iteration
      const toolMessage: ChatCompletionToolMessageParam = {
        role: "tool",
        tool_call_id: toolCall.id,
        content: result,
      };
      messages.push(toolMessage);
    }
  }
 
  // Safety net: force an answer if we hit the iteration cap
  const finalResponse = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      ...messages,
      { role: "user", content: "Based on all information retrieved, provide your best answer now." },
    ],
    temperature: 0.1,
  });
 
  return {
    answer: finalResponse.choices[0].message.content ?? "No answer generated",
    toolCalls: toolCallLog,
    iterations,
  };
}

Python:

python

# The agentic loop: plan → retrieve → evaluate → repeat or answer
import json
 
SYSTEM_PROMPT = """You are a research assistant with access to a knowledge base.
 
Strategy:
1. Analyze what information the question needs.
2. Choose the best retrieval tool(s) for the question type.
3. For comparison questions, search for EACH item separately.
4. After retrieval, evaluate: do you have enough context?
5. If not, reformulate and try again.
6. Once sufficient, answer directly. Never fabricate beyond retrieved docs."""
 
def agentic_rag(query: str, max_iterations: int = 3) -> dict:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": query},
    ]
    tool_call_log = []
    iterations = 0
 
    while iterations < max_iterations:
        iterations += 1
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=retrieval_tools,
            tool_choice="auto",  # agent decides: tool call or answer
            temperature=0.1,
        )
        message = response.choices[0].message
        messages.append(message)
 
        # No tool calls = agent is ready to answer
        if not message.tool_calls:
            return {"answer": message.content or "No answer", "tool_calls": tool_call_log, "iterations": iterations}
 
        # Execute tools, feed results back into conversation history
        for tool_call in message.tool_calls:
            args = json.loads(tool_call.function.arguments)
            result = execute_tool(tool_call.function.name, args)
            tool_call_log.append(f"{tool_call.function.name}({json.dumps(args)})")
            messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})
 
    # Safety net: force answer at iteration cap
    messages.append({"role": "user", "content": "Based on all information retrieved, provide your best answer now."})
    final = client.chat.completions.create(model="gpt-4o-mini", messages=messages, temperature=0.1)
    return {"answer": final.choices[0].message.content or "No answer", "tool_calls": tool_call_log, "iterations": iterations}

The refund query, solved

Now let's trace the exact query that broke naive RAG: "Compare the refund policies across our three pricing tiers."

The agent decomposes the comparison into three targeted retrievals

The agent makes three separate retrieval calls -- one per tier -- using metadata filters. It recognizes this is a comparison question and acts accordingly. Naive RAG would have made one semantic search and returned whichever tier's refund policy happened to be closest to the query embedding.

For a simple lookup like "What is the enterprise refund policy?", the agent calls semantic_search once and answers in one iteration. Same cost as naive RAG. The intelligence only adds cost when it needs to.

For a multi-hop question like "Which pricing tier includes the features from our latest product announcement?", the agent chains: keyword search for the announcement, semantic search for pricing tiers that mention those features, metadata filter to fill in gaps. Three iterations, three different strategies, one complete answer.

Self-correction: the closed loop

The self-correction evaluator is what separates basic function-calling from true agentic RAG. After each retrieval, the agent explicitly assesses whether results are sufficient.

TypeScript:

typescript

// Evaluation tool — agent calls this to assess its own retrieval quality
const evaluationTool: ChatCompletionTool = {
  type: "function",
  function: {
    name: "evaluate_retrieval",
    description: "Assess whether retrieved context is sufficient. Call after each retrieval.",
    parameters: {
      type: "object",
      properties: {
        retrieved_context_summary: { type: "string", description: "What was retrieved so far." },
        missing_information: { type: "string", description: "What's still needed. 'none' if sufficient." },
        confidence: { type: "number", description: "0-1 confidence the context can answer the question." },
        // This is the key decision: does the agent search more or answer now?
        next_action: {
          type: "string",
          enum: ["answer_now", "search_more", "refine_query"],
          description: "What to do next.",
        },
      },
      required: ["retrieved_context_summary", "missing_information", "confidence", "next_action"],
    },
  },
};
 
// When the agent calls evaluate_retrieval, parse its self-assessment
function handleEvaluation(args: {
  confidence: number;
  next_action: "answer_now" | "search_more" | "refine_query";
  missing_information: string;
}): { shouldContinue: boolean; feedback: string } {
  console.log(`[Eval] confidence=${args.confidence} action=${args.next_action}`);
 
  // High confidence + answer_now = stop retrieving
  if (args.confidence >= 0.8 && args.next_action === "answer_now") {
    return { shouldContinue: false, feedback: "Context sufficient. Generate your answer." };
  }
 
  // Agent identified a gap — tell it to target that gap specifically
  if (args.next_action === "refine_query") {
    return { shouldContinue: true, feedback: `Missing: ${args.missing_information}. Reformulate to target this gap.` };
  }
 
  return { shouldContinue: true, feedback: `Confidence: ${args.confidence}. Try a different strategy.` };
}

Python:

python

# When the agent calls evaluate_retrieval, parse its self-assessment
def handle_evaluation(args: dict) -> tuple[bool, str]:
    """Returns (should_continue, feedback_to_agent)."""
    confidence = args["confidence"]
    action = args["next_action"]
    missing = args["missing_information"]
 
    print(f"[Eval] confidence={confidence} action={action}")
 
    # High confidence = stop retrieving, generate answer
    if confidence >= 0.8 and action == "answer_now":
        return False, "Context sufficient. Generate your answer."
 
    # Agent found a gap — direct it to fill that specific gap
    if action == "refine_query":
        return True, f"Missing: {missing}. Reformulate to target this gap."
 
    return True, f"Confidence: {confidence}. Try a different retrieval strategy."

Add this tool to your retrievalTools array. When the agent calls evaluate_retrieval, feed the feedback string back as the tool response. The agent sees the evaluation in its conversation history and adjusts its next retrieval. Retrieve, evaluate, refine, repeat. A closed loop.

For our refund policy agent, this is the moment it catches itself. After retrieving Starter and Enterprise policies, the evaluation fires: confidence=0.6, missing_information="Professional tier refund policy", next_action="search_more". The agent makes one more targeted retrieval and gets the complete picture.

Iterative retrieval with self-correction has been shown to yield gains of up to 25.6 percentage points over single-pass retrieval on multi-hop scientific QA tasks, specifically by catching late-hop failures and correcting hypothesis drift.

The cost-accuracy tradeoff

Agentic RAG costs more per query. Here's exactly where the extra tokens go, and why the math still works.

Metric	Naive RAG	Agentic RAG
Cost per query	$0.002	$0.007
Daily cost (10K queries)	$20	$70
Monthly cost	$600	$2,100
Accuracy (simple queries, ~65% of traffic)	85%	88%
Accuracy (multi-hop queries, ~35% of traffic)	52%	87%
Weighted accuracy	73.5%	87.7%
Wrong answers per day	2,650	1,230

The daily compute increase is $50. But 1,420 fewer wrong answers per day means 1,420 fewer potential escalations. If each human escalation costs $5-15, the daily savings ($7,100-$21,300) dwarf the compute cost by two orders of magnitude.

The key optimization: route queries to the right pipeline. A lightweight classifier sends simple queries through naive RAG at $0.002 each, and complex queries through agentic RAG at $0.007. You get 90%+ of the accuracy improvement at 40% of the full agentic cost.

When NOT to use it

Agentic RAG adds latency (2-6 seconds vs. 0.5-3 seconds), cost (3.5x), and complexity. Sometimes that tradeoff is a bad deal.

Single-document lookups. If 90%+ of queries are "What is X?" and the answer lives in one chunk, the agent's planning step adds latency for zero accuracy gain.

Low-stakes internal tools. An employee FAQ bot where wrong answers get corrected casually doesn't justify the compute cost. Save agentic RAG for customer-facing systems where wrong answers have consequences.

Small, uniform knowledge bases. Fifty pages of product docs in a consistent format? Top-K with good chunking covers it. Agentic RAG shines when documents vary in structure, authority, and scope.

Latency-critical paths. Voice agents with sub-2-second requirements can't afford 3-5 seconds of agentic retrieval. Use naive RAG with a reranker (+200ms, +10-30% precision) as a middle ground.

The decision framework: Measure your multi-hop failure rate. Below 15%, naive RAG is fine. Between 15-30%, add a reranker. Above 30%, build the agentic layer.

Migration: three stages

You don't rebuild from scratch. Upgrade incrementally, measuring at each stage.

Stage 1: Add a reranker (1 day). Keep naive RAG. After top-K retrieval, add a cross-encoder reranker to re-score chunks. Catches "semantically similar but irrelevant" failures. Expected: +10-30% retrieval precision. Latency: +200ms.

Stage 2: Add query routing (2-3 days). Classify each query as simple, multi-hop, or no-retrieval. Route accordingly.

typescript

// Stage 2: Query router — cheapest path to agentic gains
async function hybridRag(query: string): Promise<string> {
  // One cheap LLM call to classify the query type
  const classification = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Classify this query into one category:
- "simple": single fact, one document
- "multi_hop": needs 2+ documents or comparison
- "no_retrieval": answerable from general knowledge
Respond with just the category name.`,
      },
      { role: "user", content: query },
    ],
    temperature: 0,
    max_tokens: 20,
  });
 
  const queryType = classification.choices[0].message.content?.trim() ?? "simple";
 
  // Only complex queries pay the agentic cost
  switch (queryType) {
    case "no_retrieval": return directLlmAnswer(query);
    case "multi_hop":    return (await agenticRag(query)).answer;
    default:             return naiveRag(query);
  }
}

Stage 3: Add self-correction (1-2 days). Add the evaluate_retrieval tool to your agentic loop. Now the agent catches its own gaps. This is the full pattern from this article.

At each stage, run your evaluation set. If Stage 1 gets you above 85% on multi-hop queries, you may not need Stage 3. Let the numbers decide.

Where this is heading

The A-RAG paper revealed something important: agentic retrieval performance scales with model capability. When they upgraded from GPT-4o-mini to GPT-5-mini, accuracy on 2WikiMultiHopQA jumped from 60.2% to 89.7%. The architecture didn't change. The agent just made better retrieval decisions.

This means the pipeline you build today gets better as models improve, without architectural changes. The tools stay the same. The loop stays the same. The decisions get sharper.

Back to our refund policy agent: it started by confidently delivering half an answer. Now it decomposes the question, retrieves all three policies, evaluates its own completeness, and delivers the full comparison. Same documents. Same LLM. The only difference is who controls the retrieval decisions.

That's the trajectory. RAG started as a pipeline. It's becoming a team.

Ready to build? Start with the naive RAG baseline from the RAG-from-scratch tutorial. When your evaluation shows multi-hop accuracy below 70%, come back here and add the agentic layer. If you're building agents that need retrieval alongside tools and memory, the combination of agentic RAG with MCP tool execution is where production systems are heading.

Build agents with knowledge, memory, and tools

Chanl connects your AI agents to knowledge bases, persistent memory, and external tools through MCP. Build the retrieval layer once, monitor it in production, and improve it with real conversation data.

Start building

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

rag ai-agents agentic-rag typescript python learning-ai retrieval vector-search

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.