A team I spoke with last month was running GPT-4o for their customer support agent. Their knowledge base had 400 documents. The agent kept giving wrong answers about their refund policy.
Their fix? Upgrade to Claude Opus. Triple the per-token cost. The wrong answers got more articulate. They didn't get more correct.
This happens constantly. A RAG system returns bad answers, and the first instinct is to blame the model. Swap it for a bigger one. Fine-tune it. Throw more parameters at the problem. It almost never works, because the model was never the problem.
The Model Isn't Your Problem
Generation is the last step in the RAG pipeline, and it's downstream of everything else. The model receives chunks of text and synthesizes an answer from them. That's it. If the chunks don't contain the right information, the model has two options: say "I don't know" or fill the gap with something plausible. Most models choose the second option, because that's what they're optimized to do.
This is the part teams miss. When your RAG agent confidently states the wrong refund policy, it's not because the model lacks capability. It's because the retrieval pipeline handed it a chunk from a 2024 policy document instead of the current one. Or it handed the model a chunk that was cut off mid-sentence, missing the critical exception clause.
Upgrading from GPT-4o to Opus or Gemini won't fix that. You'll just get wrong answers delivered with better grammar.
Enterprise deployments bear this out. RAG reduces the 40-60% factual error rate seen in vanilla LLM chatbots to under 10%, but nearly all remaining errors trace back to retrieval failures, not generation failures. The model works fine when it gets the right context. The pipeline upstream is what breaks.
The Real Culprits
Three retrieval problems cause the vast majority of RAG quality failures. None of them are solved by a better model.
Bad chunking destroys context before it reaches the model
Fixed-size chunking is the default in most RAG tutorials and frameworks. Set a token limit (usually 500), split the document, add some overlap, embed, done.
The problem is that documents aren't written in 500-token units. A refund policy might have a main rule in one paragraph and a critical exception two paragraphs later. Fixed chunking splits them into separate chunks. The model retrieves the rule without the exception and delivers a confidently incomplete answer.
This is worse than a wrong answer. It's a partially correct answer that sounds authoritative, so nobody questions it until a customer gets burned.
Recursive chunking, which splits first by sections, then paragraphs, then sentences, preserves more structure. Semantic chunking goes further by detecting topic boundaries using embedding similarity, keeping conceptually unified text together. Recent benchmarks show recursive splitting at 512 tokens with overlap remains a strong default, but the right strategy depends entirely on your content. Legal docs need section-aware splitting. FAQ pages need question-answer pair preservation. One-size-fits-all chunking is the single biggest quality destroyer in most RAG systems.
Wrong embedding model misses your domain
Your embedding model converts text into vectors. If the model doesn't understand your domain's vocabulary, semantically similar content won't land near each other in vector space.
A general-purpose embedding model trained on web text will treat "ARM processor" and "arm injury" as related. It'll put "401(k) rollover" far from "retirement account transfer" because the surface text looks different. (Our Learning AI series covers how embeddings and vector similarity actually work if you want the technical foundations.) Domain-specific content requires either a domain-tuned embedding model or, at minimum, one trained on diverse enough data to handle your terminology.
This is especially brutal for industries with heavy jargon: legal, medical, financial services, manufacturing. The embedding model is the lens through which your entire knowledge base is perceived. A blurry lens means blurry retrieval, no matter how sharp the model on the other end.
Pure vector search fails on exact terms
Vector search is excellent at "vibes." Query about customer complaints and it'll find documents about user feedback, support tickets, and satisfaction surveys. That semantic flexibility is the whole point.
But users don't always search by vibes. They search for "error code E-4012." They search for "Model X Pro 2026 warranty." They search for "Section 7.3.2 of the service agreement."
Pure vector search handles these terribly. The embedding for "E-4012" doesn't reliably land near the document that mentions error code E-4012, because there's no semantic relationship to exploit. It's a literal string match problem being solved by a semantic similarity tool.
This is why production RAG systems are moving to hybrid search: vector similarity for meaning, BM25 keyword matching for exact terms. The numbers are hard to argue with. Studies show BM25 alone retrieves relevant documents 62% of the time. Vector search alone hits 71%. Hybrid search with re-ranking reaches 87%.
The Fixes That Actually Work
Each failure point in the retrieval pipeline has a well-understood fix. None of them require a bigger model.
| Problem | Default Approach | Better Approach | Improvement |
|---|---|---|---|
| Chunking | Fixed 500-token splits | Recursive or semantic chunking at natural boundaries | Preserves context, reduces partial-answer hallucinations |
| Embeddings | General-purpose model (e.g., text-embedding-ada-002) | Domain-tuned or multilingual model matched to content | Better clustering of domain-specific concepts |
| Search | Pure vector similarity (top-K) | Hybrid search: vector + BM25 keyword matching | 20-30% retrieval accuracy improvement |
| Ranking | Return top-K results as-is | Two-stage: retrieve top-20, re-rank with cross-encoder, keep top-3 | Up to 28% NDCG improvement, catches misranked relevant docs |
| Freshness | Re-index manually when someone remembers | Version-controlled docs with freshness weighting | Prevents stale content from outranking current policy |
Re-ranking deserves special attention because it's the highest-impact fix most teams haven't tried. A cross-encoder re-ranker (like Cohere Rerank or an open-source BGE model) examines each query-document pair together, not independently. This catches cases where a document is highly relevant but didn't score well in initial retrieval because the query phrasing didn't match. Adding re-ranking to an existing pipeline typically improves accuracy 20-35% with only 200-500ms of additional latency.
The catch is that these fixes are cumulative. Hybrid search alone helps. Hybrid search plus re-ranking helps more. But if your chunks are shredding context at the source, better search and ranking just surface broken chunks more efficiently.
Fix chunking first. Then search. Then ranking. That's the order. (Chanl's knowledge base handles all three layers so you're not building custom retrieval infrastructure from scratch.)
How to Diagnose Before You Spend
Before upgrading your model, spending a weekend on re-ranking infrastructure, or switching vector databases, do this: look at what your retrieval pipeline actually returns.
Pull the last 50 queries where users reported wrong answers. For each one, inspect the chunks that were retrieved. Not the final generated answer. The chunks. Ask two questions:
First, does the retrieved content contain the correct answer? If the right information isn't in the chunks, no model on earth will generate the right answer. This is a retrieval problem, full stop.
Second, is the correct chunk present but ranked too low? If the answer appears in chunk number 8 but you're only passing the top 3 to the model, you have a ranking problem. Re-ranking fixes this.
If the right chunk is retrieved and ranked highly, and the model still generates a wrong answer, then you have a generation problem. Upgrade the model. But in my experience, this is the root cause less than 20% of the time.
Separate retrieval quality measurement from generation quality measurement. Tools like analytics dashboards and automated scorecards help here, giving you visibility into what's happening at each stage of the pipeline rather than just measuring the end result. Without that separation, you're debugging a five-stage pipeline by looking only at the output.
We covered the mechanics of building these retrieval pipelines in RAG from Scratch, including chunking strategies and embedding selection. And if you're running into problems where RAG alone isn't covering your production needs, the knowledge base bottleneck article digs into the broader architecture around freshness, governance, and structured data as a complementary retrieval tier.
Fix Retrieval First. Upgrade Models Second
GPT-4o-mini with great retrieval will beat Claude Opus with bad retrieval. Every time.
The math is simple. If your retrieval pipeline returns the right chunks 90% of the time, even a mid-tier model generates correct answers for most queries. If your pipeline returns the right chunks 50% of the time, Opus will hallucinate for the other 50% just as confidently as 4o-mini would.
The model is the last mile. Retrieval is the road. Build the road first.
Once retrieval is solid, pairing it with tools that verify facts against live data sources and persistent memory that tracks what your agent has learned across conversations closes the remaining gaps. But none of that matters if the chunks feeding your model are wrong.
Stop blaming the model. Start inspecting your chunks.
And if you're also wondering why your agent can call 50 tools but can't remember what a customer said yesterday, that's a different gap entirely. If you're wrestling with tool calling differences across providers, MCP is solving that fragmentation at the protocol level.
RAG That Works Out of the Box
Chanl's knowledge base handles chunking, hybrid search, and retrieval quality so you don't have to build the pipeline yourself.
Try Knowledge BaseCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



