A team ships a RAG chatbot. It handles simple questions fine. Customers ask about the return policy, the bot finds the right document, generates a clean answer. Ship it.
Then someone asks: "How does your current refund policy compare to last year's?"
The bot retrieves the current refund policy. That's what vector search found. It generates a confident answer about the current policy and says nothing about last year's. The customer thinks there's no difference. There is.
That's not a hallucination. That's naive RAG working exactly as designed. It embedded the query, grabbed the top-K similar chunks, and generated from what came back. It has no concept of "this question needs two documents from two time periods." No feedback loop. No self-correction. No planning.
Most teams are still running that same pipeline from 2023. Meanwhile, the research community has branched RAG into five distinct architectures, each designed to catch a different class of failure. Some teach the model to question its own retrieval. Others route queries down entirely different paths based on complexity. One throws out the pipeline entirely and hands retrieval to an agent.
Here's how each one works, when it helps, and when it's overkill.
Why Does Naive RAG Keep Failing?
Naive RAG structurally cannot answer comparison, multi-hop, or freshness-dependent questions because it has no feedback loop. It embeds the query, retrieves top-K chunks, and generates. It can't reason across documents, can't detect when retrieval failed, and can't skip retrieval when it's unnecessary. That refund policy question wasn't unusual. Three failure modes keep showing up.
Multi-hop reasoning. "Compare policy A to policy B" requires retrieving both documents, recognizing they need to be compared, and synthesizing across them. Vector search retrieves by similarity to the query. If the query mentions policy A, that's what you get. Policy B might have a cosine similarity of 0.6 and fall below the threshold.
Stale or conflicting sources. When your knowledge base has three versions of the same document, naive RAG has no mechanism to prefer the current one. It retrieves whatever is most similar to the query embedding, regardless of freshness or authority. The model then generates from potentially outdated context without knowing it.
Queries that need no retrieval at all. "What time zone is London in?" doesn't need a document lookup. But naive RAG retrieves anyway, potentially injecting irrelevant context that confuses the model. Every query pays the retrieval tax whether it benefits from it or not.
These aren't edge cases. In production knowledge base deployments, multi-hop and comparison questions account for 20-35% of user queries. If your RAG pipeline can't handle them, a third of your answers are degraded. And as we covered in our RAG quality analysis, upgrading the model won't fix retrieval problems.
What Are the Five RAG Architectures That Replace the Basic Pipeline?
Five research teams tackled naive RAG's failures from different angles: Self-RAG, Corrective RAG, Adaptive RAG, GraphRAG, and Agentic RAG. Each solves a specific failure mode, they aren't interchangeable, and choosing the right one can lift accuracy by 20-35 percentage points on the queries that matter.
So how do you handle the refund policy comparison? The multi-hop question, the stale document, the query that didn't need retrieval at all?
Self-RAG: The Model That Questions Itself
Self-RAG trains the language model to evaluate its own retrieval and generation at every step, using four special reflection tokens to decide whether to retrieve, whether retrieved content is relevant, whether the generation is supported, and whether the response is useful.
Akari Asai and colleagues at the University of Washington introduced Self-RAG in October 2023. The core insight: instead of always retrieving (standard RAG) or never retrieving (standard LLM), let the model decide. The four reflection tokens work like internal quality gates:
| Token | Question It Answers | Possible Values |
|---|---|---|
| Retrieve | Do I need to look something up? | Yes, No, Continue |
| ISREL | Is this retrieved passage relevant? | Relevant, Irrelevant |
| ISSUP | Does my answer actually use this evidence? | Fully Supported, Partially, No Support |
| ISUSE | Is this response useful overall? | Score 1-5 |
The model generates these tokens during inference, not as a separate pipeline step. If ISREL fires "Irrelevant," the model skips that passage. If ISSUP returns "No Support," it re-retrieves. A 7B-parameter Self-RAG model outperformed ChatGPT and retrieval-augmented Llama2-chat on open-domain QA, reasoning, and fact verification tasks.
When to use it. High-stakes QA where factual grounding matters more than speed. Medical, legal, financial domains where generating unsupported claims carries real risk. The tradeoff: you need to fine-tune the model to produce reflection tokens, which means training infrastructure.
Corrective RAG: Three Paths When Retrieval Fails
Corrective RAG inserts a lightweight evaluator between retrieval and generation that scores results as Correct, Ambiguous, or Incorrect, then routes each through a different corrective path. On PubHealth, CRAG improved accuracy by 36.6% over standard RAG.
Shi-Qi Yan and colleagues published CRAG in January 2024. The architecture solves the specific problem of naive RAG blindly generating from whatever retrieval returns, even when retrieval returned garbage. A T5-based evaluator scores retrieval confidence, then three things can happen:
The Correct path refines retrieved documents using a decompose-then-recompose technique, stripping irrelevant sentences before generation. The Ambiguous path does the same refinement but supplements with web search results. The Incorrect path discards retrieved documents entirely and generates from web search alone.
This is surgical. Instead of one fixed pipeline for every query, CRAG adapts based on retrieval quality. On PopQA, it improved accuracy by 19% over standard RAG. When built on top of Self-RAG, it added another 6.9% accuracy on PopQA and 5% FactScore on biography generation.
When to use it. Customer-facing agents where a wrong answer is worse than a slow answer. Support bots, policy assistants, anything where "I don't know" beats a confident wrong answer. The evaluator adds latency (one extra model call) but catches retrieval failures before they become generation failures.
Adaptive RAG: Not Every Question Deserves a Search
Here's something obvious that the first three architectures ignore: not every question needs retrieval at all.
"What's 15% of $200?" Your RAG pipeline dutifully embeds this, searches your knowledge base, pulls back three irrelevant chunks about pricing tiers, and then the model ignores them and does the math. You paid for retrieval that added nothing.
Adaptive RAG (Jeong et al., NAACL 2024) puts a small classifier in front of the pipeline. Before any retrieval happens, it predicts query complexity and routes accordingly:
| Query Complexity | Route | Example |
|---|---|---|
| Simple | Direct LLM (no retrieval) | "What year was Python released?" |
| Medium | Single-step retrieval | "What's our refund policy for enterprise?" |
| Complex | Iterative multi-step retrieval | "Compare our Q1 and Q2 pricing changes" |
This solves the retrieval tax problem. In a mixed workload, 30-50% of queries are simple enough that retrieval adds latency without improving accuracy. Adaptive RAG skips retrieval for those queries, reserves single-step retrieval for straightforward lookups, and deploys the expensive iterative pipeline only for genuinely complex questions.
When to use it. Mixed workloads where query complexity varies widely. Internal knowledge assistants where employees ask everything from "what's the WiFi password" to "how did our compliance requirements change between 2024 and 2025." The classifier overhead is minimal (one small model inference), and the latency savings on simple queries are significant.
GraphRAG: When Documents Have Relationships
GraphRAG builds a knowledge graph from your documents and retrieves connected subgraphs instead of isolated chunks. It handles relational questions ("how does X affect Y") that vector search structurally cannot answer, using 97% fewer tokens than raw document summarization at the highest abstraction level.
Microsoft Research published GraphRAG in April 2024 (Edge et al.). The architecture works in two stages: first, an LLM extracts entities and relationships from source documents to build a knowledge graph. Then it generates community summaries at multiple hierarchy levels. At query time, it retrieves relevant subgraphs, not just similar text chunks.
This matters for a specific class of questions. "How does the engineering team's on-call policy relate to the SLA commitments in our enterprise contracts?" requires understanding relationships between documents, not just finding similar ones. Vector search can find the on-call policy. It can find the SLA document. It can't connect them.
GraphRAG handles two retrieval modes. Local search answers questions about specific entities by retrieving the entity's neighborhood in the graph. Global search answers broad sensemaking questions by aggregating community summaries across the entire corpus.
The cost concern is real. Building the initial knowledge graph requires many LLM calls. Microsoft's LazyGraphRAG variant (late 2024) addresses this by deferring all LLM calls to query time instead of indexing time. And LinearRAG, accepted at ICLR 2026, eliminates LLM token costs during graph construction entirely using a relation-free approach.
When to use it. Enterprise knowledge bases with structured relationships: policy hierarchies, org charts, product catalogs with dependencies, regulatory frameworks. If your users ask "compare" or "how does X relate to Y" questions frequently, that's a signal. If queries are mostly simple lookups, the graph construction cost isn't justified.
Agentic RAG: The Agent IS the Pipeline
Every architecture so far still has a pipeline. Agentic RAG throws it out.
Instead of a fixed sequence, an agent loop takes over: plan what to retrieve, execute, evaluate the results, re-query if they're insufficient, synthesize when they're good enough. Retrieval becomes a tool the agent calls, not a stage it passes through. We covered this shift in depth in our agentic RAG deep dive.
The A-RAG framework (February 2026) formalized this with three retrieval interfaces at different granularities: keyword search for casting a wide net, semantic search for narrowing within candidates, and chunk read for drilling into exact passages. The agent picks the right tool for each step. On HotpotQA, it scored 94.5%, and the approach scales with model capability: as models improve, the retrieval gets better automatically.
RAG-EVO (EPIA 2025) pushed further with evolutionary learning. The system maintains persistent vector memory and improves its retrieval strategies over time, achieving 92.6% composite accuracy and outperforming Self-RAG, HyDE, and ReAct in head-to-head comparisons.
When to use it. Complex research tasks requiring multi-source synthesis. Anything where a single retrieval pass isn't enough: competitive analysis, due diligence, incident investigation. If your agent uses tools and MCP connections alongside retrieval, you're already partway to agentic RAG. The tradeoff is cost (2-5x per query) and latency (2-6 seconds versus sub-second for naive RAG).
How Do You Choose the Right RAG Architecture?
You don't pick a RAG architecture based on which paper was most impressive. You pick based on how your current system fails. Match the failure mode to the architecture, and the decision becomes straightforward.
| Architecture | Best For | Latency Impact | Accuracy Gain | Implementation Complexity | Upgrade When... |
|---|---|---|---|---|---|
| Naive RAG | Simple single-doc lookups | Baseline (0.5-2s) | Baseline | Low | You're just starting |
| Self-RAG | High-stakes factual QA | +0.5-1s (reflection) | +10-15% on fact verification | High (requires fine-tuning) | Wrong answers carry legal/medical risk |
| Corrective RAG | Customer-facing agents | +0.5-1s (evaluator) | +19-37% on noisy corpora | Medium (add evaluator) | Users report confident wrong answers |
| Adaptive RAG | Mixed-complexity workloads | -30% on simple queries | +5-10% overall | Medium (train classifier) | 30%+ of queries don't need retrieval |
| GraphRAG | Relational/comparative queries | +1-3s (graph traversal) | +20%+ on multi-hop | High (graph construction) | Users ask "compare X to Y" questions |
| Agentic RAG | Multi-source research tasks | +2-5s (agent loop) | +15-30% on complex QA | High (agent framework) | Single retrieval pass consistently fails |
A practical decision tree:
- Are your answers wrong because of bad retrieval quality? Fix chunking and re-ranking first. Architecture changes won't help if you're feeding garbage to the generator.
- Are your answers wrong because the model generates unsupported claims? Self-RAG or Corrective RAG. Both add verification layers.
- Are your queries highly variable in complexity? Adaptive RAG. Stop paying retrieval costs on simple questions.
- Do users ask relational or comparative questions? GraphRAG. Vector search can't connect documents.
- Do queries require iterating over multiple sources? Agentic RAG. Let the agent decide what to retrieve and when to stop.
Most teams should start at step 1. The biggest accuracy gains in production RAG come from fixing embeddings and retrieval, not from switching architectures.
Is RAG Dead Now That Models Accept 1M+ Tokens?
Long-context models haven't killed RAG. They've changed when you need it. Gemini 2.5 Pro handles 2 million tokens natively, and Claude supports 200K (1M with extended context). The question is obvious: why retrieve at all when you can stuff everything into context?
The "RAG is dead" discourse peaked in early 2025 and has since settled into a more nuanced position. Enterprise RAG adoption accelerated through 2025, with the global RAG market projected to grow from $2 billion to over $40 billion by 2035. The discourse and the deployment data tell opposite stories.
Three reasons RAG persists:
Cost. Sending 200K tokens per query adds up fast. Even mid-tier models charge $1-3 per million input tokens, putting a single 200K-token query at $0.20-0.60 for input alone. At 10,000 queries per day, that's $2,000-6,000 daily. RAG with focused retrieval costs 10-20x less per query. For most production workloads, the economics aren't close.
Freshness. Long-context models work on static data you load into the context window. Production knowledge bases change daily. RAG retrieves from the current state of your data. You don't need to reload a million-token context every time a document updates.
Precision. Counterintuitively, models perform worse at finding specific facts buried in very long contexts. The "needle in a haystack" problem scales with context length. RAG with good retrieval is more precise because it surfaces only the relevant chunks.
The real convergence is context engineering: using RAG to identify the right 5,000 tokens, then using long-context capabilities to reason across them. That's not RAG dying. That's RAG becoming one component in a larger orchestration system. The architectures in this article, particularly Adaptive RAG and Agentic RAG, already point in this direction.
Before You Rearchitect Anything
Fix chunking, add a re-ranker, and build an evaluation set before changing architectures. Most RAG quality problems trace back to bad retrieval, not the wrong pipeline design. If you're running naive RAG today, resist the urge to jump straight to Agentic RAG.
-
Fix your chunking. Semantic chunking at paragraph or section boundaries beats fixed-size token splits for nearly every document type. Our RAG from scratch guide walks through the implementation.
-
Add a re-ranker. A cross-encoder re-ranker between retrieval and generation catches the most common failure: semantically similar but factually irrelevant chunks making it into context. This alone typically improves answer quality more than any architecture swap.
-
Measure before upgrading. Build an evaluation set. Run your current pipeline against it. Identify which failure mode dominates: unsupported generation (Self-RAG), bad retrieval (Corrective RAG), wasted retrieval (Adaptive RAG), missing relationships (GraphRAG), or insufficient iteration (Agentic RAG). Then upgrade to the architecture that matches.
Remember that refund policy question from the opening? With Corrective RAG, the evaluator would flag the single-document retrieval as incomplete. With GraphRAG, the knowledge graph would connect both policy versions. With Agentic RAG, the agent would retrieve the current policy, recognize the question asks for a comparison, retrieve the previous version, and synthesize across both.
The fix isn't smarter generation. It's retrieval that knows when it hasn't found enough. Monitor what fails through scorecards and analytics, and upgrade only when the data tells you which failure mode to fix.
Knowledge Base That Grows With You
Chanl handles chunking, hybrid search, and retrieval quality monitoring. Start with simple RAG, upgrade as your agent demands it.
Try Knowledge BaseCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



