Ask an LLM about your company's return policy and it'll confidently make one up. The model doesn't know your docs exist — it's generating from training data, not your data.
Retrieval-Augmented Generation (RAG) fixes this. Instead of hoping the model memorized the right information during training, you fetch the relevant documents first and hand them to the model as context. The model generates an answer grounded in your actual data. No fine-tuning, no retraining, no waiting weeks for a model update.
Here you'll build a complete RAG pipeline from scratch in TypeScript: chunking, embeddings, vector search, and generation. No framework abstractions hiding the moving parts — just the raw components wired together so you understand every piece.
| Pipeline Stage | What it does | Key decision |
|---|---|---|
| Chunking | Split documents into searchable pieces | Recursive splitting at 300–500 tokens (best default) |
| Embedding | Convert text chunks into vector representations | text-embedding-3-small for cost; 3-large for quality |
| Vector store | Store and search embeddings by similarity | In-memory for prototyping; Pinecone/pgvector for production |
| Retrieval | Find top-K chunks closest to the user's query | Top 3–5 chunks balances precision and context coverage |
| Generation | LLM answers using only the retrieved context | Constrain the prompt to prevent hallucination beyond context |
| Evaluation | Score relevance, faithfulness, and answer quality | LLM-as-judge with structured rubric |
What RAG actually does
Retrieval-Augmented Generation works in three stages: index your documents as vector embeddings, retrieve the most relevant chunks for a given query using similarity search, then generate an answer by feeding those chunks as context to an LLM. Everything else is optimization on top of these three.
1. Indexing — Take your documents, split them into chunks, convert each chunk into a vector embedding, and store those vectors somewhere searchable.
2. Retrieval — When a user asks a question, convert that question into a vector embedding too, then find the document chunks whose vectors are closest to the question vector.
3. Generation — Take the retrieved chunks, stuff them into a prompt alongside the user's question, and send the whole thing to an LLM. The model generates an answer grounded in those specific documents.
That's the whole pattern. The reason RAG works so well is that it separates knowing where to look (retrieval) from knowing how to answer (generation). The retrieval system handles relevance. The LLM handles synthesis and language. Each does what it's best at.
Why not just fine-tune?
Fine-tuning bakes knowledge directly into the model's weights. That sounds appealing until you consider the tradeoffs:
- Staleness. Fine-tuned knowledge is frozen at training time. When your docs change, you retrain. With RAG, you just re-index the changed documents.
- Cost. Fine-tuning GPT-4o costs ~$25 per million training tokens, takes hours, and you pay again every time your data changes. RAG embedding costs pennies per million tokens and takes seconds.
- Traceability. With RAG, you can show exactly which documents produced an answer. With fine-tuning, the model's reasoning is opaque — you can't point to a source.
- Hallucination control. Fine-tuned models still hallucinate. RAG gives you a concrete mechanism to constrain answers to retrieved context.
Fine-tuning is useful for teaching a model a new style or behavior (e.g., always respond in a specific format). RAG is for giving a model access to specific knowledge. Most teams need RAG, not fine-tuning. Some need both.
What about long context windows?
GPT-4o supports 128K tokens of context. Claude supports 200K. Can't you just dump all your docs into the prompt and skip retrieval entirely?
You can, and for small document sets it works. But there are three problems:
- Cost scales linearly. Every query pays for the full context window. Sending 100K tokens per query at $2.50/1M input tokens means $0.25 per question. RAG sends only the relevant 1-2K tokens, cutting cost by 50-100x.
- Latency increases. More input tokens = slower responses. A 100K token prompt takes noticeably longer than a 2K token prompt with three targeted chunks.
- Accuracy degrades. Research consistently shows that LLMs perform worse at finding relevant information in the middle of very long contexts. RAG pre-filters to just the relevant chunks, so the model doesn't have to search.
The practical rule: if your entire knowledge base fits in 10-20K tokens and doesn't change often, just stuff it in the prompt. Beyond that, RAG is more cost-effective, faster, and more accurate.
There's also a hybrid approach: use RAG to retrieve relevant chunks, but include a broader "context summary" in every prompt (a 500-token overview of your product or domain). This gives the model general awareness while RAG provides specific details. Think of it as the model knowing the table of contents of your knowledge base, while RAG retrieves the specific pages.
RAG for production AI agents
For AI agents in production, RAG is what turns a generic chatbot into something that actually knows your business. An agent with RAG can reference your knowledge base, pull up specific policy documents, and give answers rooted in real information — which is exactly the kind of persistent memory that makes agents useful in the real world.
Without RAG, an agent answering "What's our refund policy for enterprise customers?" has to guess based on its training data. With RAG, it retrieves your actual refund policy document and quotes the relevant section. The difference between those two experiences is the difference between a toy demo and a production tool.
The RAG architecture
A RAG pipeline flows in two directions: documents go through chunking, embedding, and storage at index time, while user queries go through embedding, similarity search against stored vectors, and LLM generation at query time. Every component is independently swappable.
Here's what we're building:
Every piece is swappable. You can change the chunking strategy, the embedding model, the vector store, or the LLM independently. That modularity is the whole point — and it's why building from scratch first is so valuable. When you use a framework like LangChain or LlamaIndex, these pieces are hidden behind abstractions. Building them yourself means you understand exactly where to look when something breaks.
Prerequisites
You'll need an OpenAI API key. The embeddings model we're using (text-embedding-3-small) costs $0.02 per million tokens — running this tutorial costs a fraction of a cent. For generation, we'll use gpt-4o-mini.
Building the pipeline
We'll build four modules — chunker, embedder, vector store, and generator — then wire them together in a main file. Each module is independent and has a single responsibility, which makes it easy to swap components later.
The file structure looks like this:
rag-from-scratch/
src/
chunker.ts # Split documents into chunks
embeddings.ts # Convert text to vectors via OpenAI
vector-store.ts # In-memory store with cosine similarity
generator.ts # Prompt construction and LLM generation
rag.ts # Main pipeline: index + query
package.jsonCreate a new project:
mkdir rag-from-scratch && cd rag-from-scratch
npm init -y
npm install openaiHere's the package.json you'll need:
{
"name": "rag-from-scratch",
"version": "1.0.0",
"type": "module",
"scripts": {
"start": "npx tsx src/rag.ts"
},
"dependencies": {
"openai": "^4.73.0"
},
"devDependencies": {
"tsx": "^4.19.0"
}
}Step 1: Chunking
First, we need to split documents into chunks. Why? Because embedding models have token limits, and smaller chunks produce more precise retrieval. If you embed an entire 50-page document as one vector, the embedding is a blurry average of everything in that document. If you embed individual paragraphs, each vector captures a specific idea — and retrieval can pinpoint exactly the right paragraph.
There are three common chunking strategies:
Fixed-size chunking — Split every N characters with overlap. Simple and predictable, but cuts mid-sentence. Works well for structured data like logs or CSVs where sentence boundaries don't matter much.
Sentence-based chunking — Split on sentence boundaries. Preserves meaning but produces uneven chunk sizes — some chunks end up with a single short sentence, others with a long paragraph.
Recursive chunking — Try splitting on paragraphs first, then sentences, then words. Keeps semantic coherence while respecting size limits. This is what LangChain uses internally, and it's what we'll build.
The recursive approach works by trying the largest separator first (double newline for paragraphs). If a resulting segment is still too large, it falls back to the next separator (single newline, then sentence boundaries, then spaces). This preserves the natural structure of your documents as much as possible.
// Recursive character text splitter
// Tries separators in order: paragraphs → sentences → words → characters
export interface Chunk {
text: string;
index: number;
metadata?: Record<string, unknown>;
}
export function chunkText(
text: string,
options: {
maxChunkSize?: number;
overlap?: number;
separators?: string[];
} = {}
): Chunk[] {
const {
maxChunkSize = 500,
overlap = 50,
separators = ["\n\n", "\n", ". ", " "],
} = options;
const chunks: Chunk[] = [];
function splitRecursive(text: string, separatorIndex: number): string[] {
if (text.length <= maxChunkSize) return [text];
if (separatorIndex >= separators.length) {
// Last resort: hard split
const parts: string[] = [];
for (let i = 0; i < text.length; i += maxChunkSize - overlap) {
parts.push(text.slice(i, i + maxChunkSize));
}
return parts;
}
const separator = separators[separatorIndex];
const parts = text.split(separator);
const merged: string[] = [];
let current = "";
for (const part of parts) {
const candidate = current ? current + separator + part : part;
if (candidate.length > maxChunkSize && current) {
merged.push(current);
current = part;
} else {
current = candidate;
}
}
if (current) merged.push(current);
// If any chunk is still too large, split it with the next separator
const result: string[] = [];
for (const chunk of merged) {
if (chunk.length > maxChunkSize) {
result.push(...splitRecursive(chunk, separatorIndex + 1));
} else {
result.push(chunk);
}
}
return result;
}
const rawChunks = splitRecursive(text, 0);
for (let i = 0; i < rawChunks.length; i++) {
const trimmed = rawChunks[i].trim();
if (trimmed.length > 0) {
chunks.push({ text: trimmed, index: chunks.length });
}
}
return chunks;
}The metadata field on each chunk is empty here, but it becomes important in production. You'd attach the source document name, section heading, page number, creation date — anything that helps you filter or rank results later. When a user asks about pricing, metadata filters can restrict the search to documents tagged "pricing" before the vector comparison even runs. Here's what production metadata typically looks like:
chunks.push({
text: trimmed,
index: chunks.length,
metadata: {
source: "pricing-faq.md",
section: "Enterprise Plan",
lastUpdated: "2026-02-15",
category: "pricing",
accessLevel: "public",
},
});This lets you do things like "only search documents updated in the last 6 months" or "only search documents the current user has access to" — critical for production systems.
The overlap parameter also deserves a note. When you split text into chunks, you lose context at the boundaries. A fact that spans two paragraphs might get cut in half. Overlap mitigates this by repeating the last N characters of each chunk at the start of the next one. Fifty characters of overlap is a reasonable default — enough to preserve boundary context without inflating your chunk count too much.
Step 2: Embeddings
Now we convert chunks into vectors. An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings have vectors that point in similar directions — "return policy" and "refund guidelines" would produce vectors that are close together, even though they share no exact words.
This is what makes RAG fundamentally different from keyword search. Traditional search requires exact word matches. Embedding-based search understands meaning. A user asking "Can I get my money back?" will match a document about "refund policies" because the embeddings capture semantic similarity, not lexical overlap.
We'll use OpenAI's text-embedding-3-small, which produces 1536-dimensional vectors. Each dimension captures some aspect of the text's meaning — the model learned these dimensions during training on billions of text pairs. You can think of each dimension as a slider on a mixing board, and the full 1536-dimension vector as a unique "fingerprint" of the text's meaning.
import OpenAI from "openai";
const openai = new OpenAI(); // Uses OPENAI_API_KEY env var
export async function embedTexts(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
// Sort by index to maintain order
return response.data
.sort((a, b) => a.index - b.index)
.map((item) => item.embedding);
}
export async function embedQuery(query: string): Promise<number[]> {
const [embedding] = await embedTexts([query]);
return embedding;
}We separate embedTexts (batch) from embedQuery (single) for clarity. The batch function accepts an array of texts and returns an array of vectors in the same order — this is important because OpenAI processes them more efficiently in a single API call than in multiple individual calls.
In production, you'd want to handle larger document sets by batching. OpenAI's API accepts up to 2048 texts per call, so for a 10,000-chunk corpus you'd split into 5 batches:
async function embedBatch(texts: string[], batchSize = 2048): Promise<number[][]> {
const allEmbeddings: number[][] = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const embeddings = await embedTexts(batch);
allEmbeddings.push(...embeddings);
}
return allEmbeddings;
}You'd also want retry logic for rate limits — OpenAI returns 429 errors when you exceed your tokens-per-minute quota. A simple exponential backoff handles this gracefully.
The sort-by-index step in embedTexts matters because OpenAI's API doesn't guarantee response order matches input order. Without it, your embeddings could get shuffled relative to your chunks — and you'd silently store the wrong vector for each chunk. This is the kind of bug that's extremely hard to debug because everything appears to work, just with slightly worse retrieval quality.
Step 3: Vector store (in-memory)
A vector store is just a collection of vectors with a way to find the nearest neighbors to a query vector. We'll start with the simplest possible implementation: an in-memory store using cosine similarity.
Cosine similarity measures the angle between two vectors. A value of 1.0 means identical direction (identical meaning), 0.0 means completely unrelated. It ignores vector magnitude, so it works regardless of whether your embeddings are normalized. This is important because different embedding models produce vectors with different magnitudes — cosine similarity gives you a consistent comparison metric.
Why cosine similarity over other distance metrics? There are three common options:
- Cosine similarity — Measures the angle between vectors. Range: -1 to 1, where 1 = identical direction. Ignores magnitude.
- Euclidean distance (L2) — Measures the straight-line distance between two points. Smaller = more similar. Sensitive to magnitude.
- Dot product — Measures both direction and magnitude. Faster to compute but results depend on vector norms.
For normalized embeddings (which OpenAI's are), all three give the same ranking. But cosine similarity is the standard for text embeddings because it's invariant to vector length, making it more robust across different embedding models and document lengths.
import { Chunk } from "./chunker.js";
export interface StoredDocument {
chunk: Chunk;
embedding: number[];
source: string;
}
export interface SearchResult {
chunk: Chunk;
score: number;
source: string;
}
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
export class VectorStore {
private documents: StoredDocument[] = [];
add(chunks: Chunk[], embeddings: number[][], source: string): void {
for (let i = 0; i < chunks.length; i++) {
this.documents.push({
chunk: chunks[i],
embedding: embeddings[i],
source,
});
}
}
search(queryEmbedding: number[], topK: number = 3): SearchResult[] {
const scored = this.documents.map((doc) => ({
chunk: doc.chunk,
source: doc.source,
score: cosineSimilarity(queryEmbedding, doc.embedding),
}));
scored.sort((a, b) => b.score - a.score);
return scored.slice(0, topK);
}
get size(): number {
return this.documents.length;
}
}This brute-force approach checks every document on every query. It's O(n) per search, which is fine for hundreds or even thousands of documents. Once you hit tens of thousands, you'll want an approximate nearest neighbor (ANN) index — which is exactly what production vector databases provide.
ANN algorithms like HNSW (Hierarchical Navigable Small World) build a graph structure over your vectors during indexing. At query time, they navigate this graph to find approximate nearest neighbors in O(log n) instead of comparing against every vector. The tradeoff is a small accuracy loss (typically 95-99% recall) for dramatically faster search — milliseconds instead of seconds at scale.
For our tutorial's three documents with six chunks, brute-force is instantaneous. But if you're indexing 100,000 support articles, each search would compare against every one of those vectors. At that scale, a dedicated vector database with HNSW indexing returns results in under 50ms where brute-force would take seconds.
Step 4: Generation
Now the fun part. We take the retrieved chunks, build a prompt, and ask the LLM to answer using only the provided context. This is where the "augmented" in Retrieval-Augmented Generation happens — the model's generation is augmented with retrieved information.
import OpenAI from "openai";
import { SearchResult } from "./vector-store.js";
const openai = new OpenAI();
export interface GenerationResult {
answer: string;
sources: SearchResult[];
prompt: string;
}
export async function generate(
query: string,
results: SearchResult[],
options: { model?: string; temperature?: number } = {}
): Promise<GenerationResult> {
const { model = "gpt-4o-mini", temperature = 0.2 } = options;
const contextBlock = results
.map(
(r, i) =>
`[Source ${i + 1}] (score: ${r.score.toFixed(3)}, from: ${r.source})\n${r.chunk.text}`
)
.join("\n\n");
const systemPrompt = `You are a helpful assistant that answers questions based on the provided context documents.
Rules:
- Answer ONLY based on the provided context
- If the context doesn't contain enough information, say so
- Cite which source(s) you used with [Source N] notation
- Be concise and direct`;
const userPrompt = `Context documents:
${contextBlock}
Question: ${query}
Answer based on the context above:`;
const response = await openai.chat.completions.create({
model,
temperature,
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: userPrompt },
],
});
return {
answer: response.choices[0].message.content ?? "",
sources: results,
prompt: userPrompt,
};
}A few design choices worth noting here:
The system prompt explicitly tells the model to only use the provided context. This is crucial — without it, the model will happily fill gaps with its training data, which defeats the purpose of RAG. This is a core prompt engineering principle: explicit constraints produce more reliable behavior.
We include the similarity score in each source block. This is useful for debugging — if your top chunk has a score of 0.65, that's a signal your retrieval might not be finding great matches, even before you look at the generated answer.
Temperature is set to 0.2, which keeps the model's output focused and deterministic. Higher temperatures (0.7+) produce more creative responses, but for factual RAG answers you want consistency. If you ask the same question twice, you should get substantially the same answer.
We return the full prompt alongside the answer. This makes debugging much easier — you can see exactly what the model received and why it responded the way it did. In production, logging the prompt, the retrieved chunks, the similarity scores, and the generated answer gives you a complete audit trail for every response.
The prompt structure itself matters more than you might think. We put the context before the question, which works well for most models. Some teams find that putting the question first ("Given this question: X, use the following context to answer:") works better for their use case. The [Source N] citation format makes it easy for users to verify answers against the original documents — transparency that builds trust in RAG-powered systems.
The top-K parameter (how many chunks to retrieve) creates a direct tradeoff between context coverage and prompt cost. More chunks means the model has more information to work with, but also more tokens to pay for and more potential for the model to get confused by irrelevant context. A good starting point:
- K=3 for focused, single-topic questions ("What's the refund policy?")
- K=5 for broader questions that might span multiple documents ("Give me an overview of the pricing tiers and what each includes")
- K=1 for simple lookup questions where you just need the closest match ("What's the phone number for support?")
Step 5: Putting it all together
Now we wire up all four components — chunker, embedder, vector store, and generator — into a complete pipeline. The indexing phase runs once (or whenever documents change), while the query phase runs for every user question.
The sample documents below simulate a real knowledge base with three types of content: product overview (features), pricing FAQ (numbers and plans), and technical documentation (the memory system). This variety lets us test whether retrieval correctly routes different types of questions to the right source documents.
import { chunkText } from "./chunker.js";
import { embedTexts, embedQuery } from "./embeddings.js";
import { VectorStore } from "./vector-store.js";
import { generate } from "./generator.js";
// Sample documents — imagine these come from your knowledge base
const documents = [
{
source: "product-overview.md",
content: `Chanl is an AI agent platform for building, connecting, and monitoring
customer experience agents. It supports voice and text channels. Agents can be
configured with custom prompts, knowledge bases, and tool integrations.
The platform provides real-time analytics for monitoring agent performance,
including call duration, resolution rates, and customer satisfaction scores.
Analytics dashboards show trends over time and highlight areas for improvement.
Agents connect to external systems through MCP (Model Context Protocol)
integrations. MCP allows agents to call APIs, query databases, and trigger
workflows in third-party tools without custom code.`,
},
{
source: "pricing-faq.md",
content: `Chanl offers three pricing tiers: Lite, Startup, and Business.
The Lite plan includes up to 5 agents and 1,000 interactions per month.
It costs $49/month and is designed for small teams getting started.
The Startup plan includes up to 25 agents and 10,000 interactions per month.
It costs $199/month and includes advanced analytics and priority support.
The Business plan includes unlimited agents and interactions.
Pricing is custom and includes dedicated support, SLAs, and SSO.`,
},
{
source: "memory-system.md",
content: `The memory system allows agents to remember information across conversations.
Short-term memory persists within a single conversation session.
Long-term memory stores facts about customers across multiple conversations.
Memory entries are automatically extracted from conversations and stored
as key-value pairs. For example, if a customer mentions they prefer email
communication, the agent stores this preference and uses it in future
interactions.
Memory can be managed through the API or the admin dashboard. Entries can
be viewed, edited, or deleted. Memory is scoped per customer per agent.`,
},
];
async function main() {
console.log("=== RAG Pipeline Demo ===\n");
// Step 1: Index documents
console.log("Indexing documents...");
const store = new VectorStore();
for (const doc of documents) {
const chunks = chunkText(doc.content, { maxChunkSize: 300, overlap: 30 });
const embeddings = await embedTexts(chunks.map((c) => c.text));
store.add(chunks, embeddings, doc.source);
console.log(` ${doc.source}: ${chunks.length} chunks`);
}
console.log(`\nTotal chunks in store: ${store.size}\n`);
// Step 2: Query
const queries = [
"What analytics features does Chanl provide?",
"How much does the Startup plan cost?",
"How does the memory system work?",
"Does Chanl support Salesforce integration?",
];
for (const query of queries) {
console.log(`Q: ${query}`);
// Retrieve
const queryEmbedding = await embedQuery(query);
const results = store.search(queryEmbedding, 3);
console.log(` Retrieved ${results.length} chunks:`);
for (const r of results) {
console.log(
` - [${r.source}] score: ${r.score.toFixed(3)} | "${r.chunk.text.slice(0, 60)}..."`
);
}
// Generate
const { answer } = await generate(query, results);
console.log(`\nA: ${answer}\n`);
console.log("---\n");
}
}
main().catch(console.error);Run it:
export OPENAI_API_KEY="sk-your-key-here"
npx tsx src/rag.tsYou should see the pipeline index your documents, retrieve relevant chunks for each query, and generate grounded answers. Here's what the output looks like:
=== RAG Pipeline Demo ===
Indexing documents...
product-overview.md: 2 chunks
pricing-faq.md: 2 chunks
memory-system.md: 2 chunks
Total chunks in store: 6
Q: What analytics features does Chanl provide?
Retrieved 3 chunks:
- [product-overview.md] score: 0.847 | "The platform provides real-time analytics for monitoring..."
- [product-overview.md] score: 0.762 | "Chanl is an AI agent platform for building, connecting..."
- [pricing-faq.md] score: 0.643 | "The Startup plan includes up to 25 agents and 10,000..."
A: Chanl provides real-time analytics for monitoring agent performance,
including call duration, resolution rates, and customer satisfaction scores.
Analytics dashboards show trends over time and highlight areas for improvement
[Source 1].Pay attention to the similarity scores — they tell you how confident the retrieval is for each chunk. In this example, the top chunk scores 0.847 (strong match), the second is 0.762 (good supporting context), and the third at 0.643 is a weaker match that was pulled in because it mentions analytics tangentially.
Notice the four queries test different aspects of the pipeline. The first three have clear answers in the documents. The last query — about Salesforce — is intentionally unanswerable from the provided context. A well-configured RAG pipeline should say it doesn't have enough information rather than hallucinate. If your pipeline makes up a Salesforce answer, your system prompt needs tightening.
This is a good sanity check for any RAG system: always include at least one question that can't be answered from the context. If the model answers it anyway, you've got a faithfulness problem.
Choosing a chunking strategy
The chunking strategy you choose has a bigger impact on retrieval quality than most people expect. Recursive chunking is your best default — it tries paragraphs first, falls back to sentences, then words, preserving semantic coherence while respecting size limits.
| Strategy | Pros | Cons | Best for |
|---|---|---|---|
| Fixed-size (every N chars) | Simple, predictable | Cuts mid-sentence, breaks meaning | Structured data, logs |
Sentence-based (split on .) | Preserves sentence meaning | Uneven sizes, some chunks too small | Clean prose, FAQs |
| Recursive (paragraph → sentence → word) | Best semantic coherence | More complex to implement | General-purpose (recommended) |
| Semantic (split when meaning shifts) | Most precise boundaries | Requires embedding each sentence first | High-quality knowledge bases |
The recursive splitter we built is the right default for most use cases. It tries to keep paragraphs together, falls back to sentences, then words. The overlap parameter ensures that context at chunk boundaries isn't completely lost.
Chunk size directly affects retrieval precision. Smaller chunks (200–300 tokens) are more precise but miss surrounding context. Larger chunks (500–1000 tokens) capture more context but dilute the signal — the embedding becomes an average of too many ideas. 300–500 tokens hits the sweet spot for most pipelines.
There's a fourth strategy worth mentioning: semantic chunking. Instead of splitting on character boundaries, you embed every sentence, then split where the cosine similarity between consecutive sentences drops below a threshold. This produces chunks that follow the natural topic boundaries of the text. The downside is that you have to embed every sentence during indexing (more API calls), so it's more expensive. But for high-value knowledge bases where retrieval quality is critical, it can meaningfully improve results.
To illustrate why strategy matters, consider this document:
Our enterprise plan includes dedicated support with a 4-hour SLA. All enterprise customers get SSO integration. Pricing is based on usage volume and starts at $500/month.
With fixed-size chunking at 80 characters, this might split into:
- Chunk 1: "Our enterprise plan includes dedicated support with a 4-hour SLA. All ente"
- Chunk 2: "rprise customers get SSO integration. Pricing is based on usage volume and"
- Chunk 3: " starts at $500/month."
A query about "enterprise pricing" would match Chunk 3 (which mentions $500 but lacks context) and possibly Chunk 2 (which mentions pricing but cuts off). The recursive splitter keeps this as a single chunk because it's under the size limit, so all three facts stay together.
A practical way to validate your chunking: run 20 representative queries and manually check whether the right chunk lands in the top 3 results. If relevant information keeps getting split across chunks or buried in oversized ones, adjust your size and overlap parameters.
Choosing an embedding model
We used OpenAI's text-embedding-3-small because it's the easiest to get started with. Here's how it compares to the alternatives:
| Model | Dimensions | Cost | Quality | Speed |
|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | $0.02/1M tokens | Good | Fast |
| text-embedding-3-large (OpenAI) | 3072 | $0.13/1M tokens | Better | Fast |
| Voyage AI voyage-3 | 1024 | $0.06/1M tokens | Excellent for code | Fast |
| Nomic Embed (local) | 768 | Free (self-hosted) | Good | Depends on hardware |
| BGE-M3 (local) | 1024 | Free (self-hosted) | Good multilingual | Depends on hardware |
For most teams, text-embedding-3-small is the right default — fast, cheap, and good enough. If you're already using OpenAI, it keeps things simple. If you need the best possible retrieval quality and don't mind the cost, text-embedding-3-large is a meaningful step up — the extra 1536 dimensions capture finer-grained semantic distinctions. If you can't send data to external APIs, Nomic or BGE-M3 run locally via Ollama.
Voyage AI deserves special mention if your documents contain code. Their voyage-3 model was trained specifically on code and technical text, so it outperforms OpenAI's models for code search and technical documentation retrieval.
One critical rule: you must use the same embedding model for indexing and querying. Vectors from different models live in different vector spaces and can't be compared meaningfully. If you switch embedding models, you need to re-embed your entire document set. This is also why it's worth choosing carefully upfront — re-embedding a million documents isn't free.
Cost estimation
Here's a quick way to estimate your embedding costs. A typical English word is about 1.3 tokens. A 500-word chunk is ~650 tokens.
| Corpus size | Chunks (at 500 tokens each) | Embedding cost (3-small) | Embedding cost (3-large) |
|---|---|---|---|
| 100 pages | ~200 chunks | $0.006 | $0.04 |
| 1,000 pages | ~2,000 chunks | $0.06 | $0.40 |
| 10,000 pages | ~20,000 chunks | $0.60 | $4.00 |
| 100,000 pages | ~200,000 chunks | $6.00 | $40.00 |
You only pay for embedding once per document. Re-embedding happens only when documents change. Query embedding costs are negligible — one query is a single API call of ~20 tokens.
Dimension reduction
OpenAI's text-embedding-3-* models support a dimensions parameter that lets you reduce the output size. You can request 256 or 512 dimensions instead of the full 1536. Smaller vectors mean faster search and less storage, at the cost of some retrieval quality. For prototyping or very large corpora where storage cost matters, this is a useful lever.
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
dimensions: 512, // Reduced from 1536
});At 512 dimensions, your vectors use 3x less memory and searches run faster, while retrieval quality drops only 1-3% for most use cases. This is a good option when you're indexing millions of documents and storage cost is a real concern. You can test by running your eval suite at 1536 vs 512 dimensions and measuring the actual quality difference for your specific data.
Choosing a vector database
The in-memory vector store we built works for demos and small datasets. For production, you'll want a dedicated vector database that handles persistence, scaling, metadata filtering, and efficient approximate nearest-neighbor search. Here's the landscape:
Pinecone — Fully managed, serverless, sub-50ms latency even at billion-scale. Best for teams that don't want to manage infrastructure. Free tier available.
Chroma — Open source, Python-native, minimal setup. Great for prototyping and small to medium datasets. Can run embedded in your process or as a separate server.
pgvector — PostgreSQL extension. If you already run Postgres, this is the lowest-friction option. Competitive performance up to ~100M vectors with the pgvectorscale extension. The biggest advantage: your vectors live alongside your relational data, so you can JOIN against metadata without a second database. Your chunks, embeddings, and document metadata all live in the same Postgres instance, queryable with standard SQL.
Example pgvector query for context:
SELECT chunk_text, source, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE category = 'pricing'
ORDER BY embedding <=> $1
LIMIT 3;The <=> operator computes cosine distance. Metadata filtering (WHERE category = 'pricing') happens before the vector search, which is exactly the kind of scoping that improves retrieval precision.
Weaviate — Open source with strong hybrid search (combining vector + keyword). Available as managed cloud or self-hosted.
Qdrant — Open source, Rust-based, excellent performance. Best free tier among dedicated vector databases.
Use pgvector if you already run PostgreSQL, Pinecone if you want fully managed with zero infrastructure, or Qdrant for the best open-source option with a generous free tier.
Here's a quick decision matrix:
| If you... | Use |
|---|---|
| Already run PostgreSQL | pgvector — no new infrastructure |
| Want zero ops | Pinecone — fully managed, serverless |
| Need open-source + self-hosted | Qdrant — best performance, Rust-based |
| Are prototyping in Python | Chroma — fastest to get started |
| Need hybrid search built in | Weaviate — vector + keyword out of the box |
Swapping from our in-memory store to any of these is straightforward — you're replacing the add() and search() methods. The chunking, embedding, and generation layers stay exactly the same. That modularity is why building from scratch first is valuable: you understand which piece does what, so upgrading one component doesn't require rethinking the whole system.
When your agents start connecting to external systems via MCP integrations and tool calls, the RAG pipeline becomes just one of several information sources. The vector store might handle product docs while an MCP tool queries live inventory data. Understanding each piece independently makes that composition straightforward.
Evaluating your pipeline
A RAG pipeline that returns wrong answers confidently is worse than no RAG at all. You need to measure three things.
Retrieval quality
Did the retriever find the right chunks? The simplest check: look at the similarity scores and the retrieved text. If the top chunk isn't relevant to the question, your retrieval is broken — no amount of generation quality fixes that.
A score above 0.8 usually means strong relevance. Between 0.6–0.8 is acceptable but worth monitoring. Below 0.6, the retriever is probably pulling in noise. Log these scores for every query in production so you can spot degradation over time.
You should also check for false negatives — queries where the correct chunk exists in your store but doesn't appear in the top-K results. This usually means the chunk's embedding doesn't capture the right semantics, or your chunk is too large and the relevant passage is buried in unrelated text.
A simple retrieval evaluation metric is Recall@K: for a set of test queries where you know which chunk should be retrieved, what percentage of the time does the correct chunk appear in the top K results? Aim for Recall@3 above 85%. If you're below that, focus on chunking and embedding quality before touching anything else.
Faithfulness
Does the generated answer actually use the retrieved context, or is the model ignoring it and hallucinating? This is the most critical evaluation dimension. A model that makes up plausible-sounding information is actively harmful — users trust it because it sounds confident.
Test this explicitly: retrieve chunks about Topic A, but ask about Topic B. If the model answers about Topic B (using training data instead of admitting the context doesn't cover it), your faithfulness constraint is too weak.
Answer quality
Is the answer correct, complete, and helpful? This is the end-to-end metric. Even with good retrieval and faithful generation, the answer might be poorly structured, miss important nuances, or be unnecessarily verbose.
Here's an evaluation function using LLM-as-judge:
import OpenAI from "openai";
import { SearchResult } from "./vector-store.js";
const openai = new OpenAI();
interface EvalResult {
relevanceScore: number;
faithfulnessScore: number;
qualityScore: number;
reasoning: string;
}
export async function evaluateResponse(
query: string,
answer: string,
retrievedChunks: SearchResult[],
referenceAnswer?: string
): Promise<EvalResult> {
const context = retrievedChunks.map((r) => r.chunk.text).join("\n\n");
const evalPrompt = `You are an evaluation judge for a RAG system. Score the following on a scale of 1-5.
Query: ${query}
Retrieved Context:
${context}
Generated Answer:
${answer}
${referenceAnswer ? `\nReference Answer: ${referenceAnswer}` : ""}
Score these three dimensions (1-5 each):
1. RELEVANCE: Are the retrieved chunks relevant to the query?
2. FAITHFULNESS: Does the answer only use information from the retrieved context? (5 = fully grounded, 1 = hallucinated)
3. QUALITY: Is the answer correct, complete, and helpful?
Respond in JSON format:
{"relevance": N, "faithfulness": N, "quality": N, "reasoning": "brief explanation"}`;
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0,
messages: [{ role: "user", content: evalPrompt }],
response_format: { type: "json_object" },
});
const parsed = JSON.parse(response.choices[0].message.content ?? "{}");
return {
relevanceScore: parsed.relevance ?? 0,
faithfulnessScore: parsed.faithfulness ?? 0,
qualityScore: parsed.quality ?? 0,
reasoning: parsed.reasoning ?? "",
};
}
// Usage: add this to your main() function
// const evalResult = await evaluateResponse(query, answer, results);
// console.log(` Eval: R=${evalResult.relevanceScore} F=${evalResult.faithfulnessScore} Q=${evalResult.qualityScore}`);
// console.log(` Reasoning: ${evalResult.reasoning}`);This uses LLM-as-judge evaluation — the same approach used by RAG evaluation frameworks like RAGAS and DeepEval. The judge can be wrong (it's an LLM too), but it's the fastest way to get automated quality signals at scale.
A few notes on making evaluation useful in practice:
- Build a test set. Curate 50–100 question-answer pairs that cover your key use cases. Run them through the pipeline after every change to catch regressions.
- Track scores over time. A sudden drop in average faithfulness score after you change your chunking strategy is a clear signal something went wrong.
- Use reference answers. When you have gold-standard answers, pass them as
referenceAnswerto give the judge something to compare against. - Don't trust the judge blindly. Spot-check its evaluations manually. If the judge consistently rates faithfulness 5/5 when you can see the model is hallucinating, your eval prompt needs work.
For a production evaluation harness with regression testing and CI integration, see our guide on building an eval framework from scratch. In production, combine automated LLM-as-judge scoring with human evaluation on a sample of queries and analytics dashboards tracking quality metrics over time. The automated scores catch regressions fast; the human reviews catch the subtle failures that automated scoring misses.
Common failure modes
Once you have a working pipeline, here's what typically breaks — and how to fix it.
Retriever finds irrelevant chunks. Your chunks are too large, or your embeddings don't capture the right semantics. Fix: smaller chunks, try a better embedding model, or add metadata filtering to scope searches by document category. If a user asks about billing but your retriever pulls in onboarding docs, metadata filters that restrict search to "billing" documents would solve this immediately.
Model ignores the context. This usually means your prompt isn't constraining the model enough, or the context is so long that the model "loses" the relevant information in the middle. Research on "lost in the middle" has shown that LLMs attend more to the beginning and end of their context window. Fix: tighten your system prompt, reduce the number of retrieved chunks, or put the most relevant chunk last.
Answers are correct but miss important details. Your top-K is too low, or the relevant information is spread across chunks that don't get retrieved together. Fix: increase top-K from 3 to 5, add a reranker to promote better chunks from a larger initial retrieval set (retrieve 20, rerank, keep 3), or try larger chunks with more overlap so related information stays together.
Performance is slow. Embedding the query + searching + generating takes too long for interactive use. Fix: cache frequently-asked-question embeddings so you skip the embedding step for repeated queries, use a vector database with ANN indexing instead of brute-force search, and consider a smaller or faster generation model for lower-stakes queries.
Answers hallucinate beyond the context. The model fills in gaps with training data even when told not to. Fix: lower the temperature to 0.1, make the system prompt more explicit about refusing to answer when context is insufficient, and add a faithfulness evaluation step that flags or blocks low-scoring responses before they reach users.
Stale documents produce wrong answers. Your knowledge base has been updated but the embeddings haven't been regenerated. Fix: build a re-indexing pipeline that watches for document changes and re-embeds affected chunks. Track the last-indexed timestamp per document so you can verify freshness. This is especially dangerous for time-sensitive content like pricing, policies, or compliance documents where an outdated answer could have real consequences.
Duplicate or near-duplicate chunks dominate results. If the same information appears in multiple documents (e.g., the refund policy is mentioned in both the FAQ and the terms of service), your top-K results might all contain the same information, pushing out other relevant context. Fix: deduplicate at indexing time by checking cosine similarity between new chunks and existing ones, or add diversity to your retrieval by penalizing chunks that are too similar to already-selected results (maximal marginal relevance).
Production considerations
Moving from this tutorial to a production RAG system involves a few additional concerns that are worth thinking about early, even if you don't implement them right away.
Hybrid search. Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid search combines both — run a BM25 keyword search and a vector search in parallel, then merge the results. Most production RAG systems use this approach. Weaviate and pgvector both support it natively.
Reranking. Vector search is fast but approximate. A cross-encoder reranker takes the top 20 results from vector search and re-scores them using a more accurate (but slower) model. The reranker sees both the query and the document together (not just their independent embeddings), so it can catch nuances that embedding comparison misses — like negation, qualifier words, or subtle distinctions. Cohere's Rerank API and open-source models like bge-reranker-v2-m3 are popular choices. The typical latency overhead is 100-300ms, which is worth it when precision matters.
Query routing. Not every question needs RAG. "What's 2+2?" doesn't require document retrieval. A lightweight classifier can route simple questions directly to the LLM and only invoke the RAG pipeline for questions that need domain knowledge. This reduces latency and cost.
Multi-turn context. In a conversation, the user's second question often references the first. "Tell me about pricing" followed by "What about the enterprise tier?" — the second query alone doesn't mention pricing. If you embed it as-is, the retriever won't know to look for pricing documents. Fix: use the LLM to rewrite the query into a standalone form ("What is the pricing for the enterprise tier?") before embedding it. This is called "query rewriting" or "question condensation" and adds one LLM call per turn but dramatically improves multi-turn retrieval accuracy.
Access control. If different users should see different documents, you need to filter results by permission at query time. Tag each chunk with the access groups that should see it, then add a filter to your vector search. This is straightforward with metadata filtering in Pinecone or pgvector, but easy to forget until someone retrieves a document they shouldn't have access to.
Document versioning. Documents change over time. Your 2024 return policy might differ from your 2026 one. If you just overwrite chunks, old queries might get stale information. A common pattern: store a version or timestamp with each chunk, and during re-indexing, insert new chunks before deleting old ones so there's no gap in coverage.
Error handling and fallbacks. In production, the embedding API might be slow or unavailable. Your vector store might time out. The generation model might return an error. Build graceful degradation: if embeddings fail, fall back to keyword search. If retrieval returns low-confidence results (all scores below 0.5), skip generation and return "I couldn't find relevant information." If generation fails, return the raw chunks with a note that the answer couldn't be synthesized.
Monitoring and observability. Log every query along with: the retrieved chunks, their similarity scores, the generated answer, and the latency of each step (embedding, retrieval, generation). This gives you a complete picture of pipeline health. Alert on: average retrieval score dropping below a threshold, generation latency spiking, or the proportion of "I don't know" answers increasing. Over time, these logs also become your eval dataset — real user queries with real retrieved results that you can use to improve chunking, embeddings, and prompts.
Caching. Two layers of caching can dramatically reduce cost and latency. First, cache embedding results for repeated queries — if 100 users ask "What's your refund policy?", you only need to embed it once. Second, cache full RAG responses for identical queries. The cache key is the query text (or its hash), and you invalidate it when the underlying documents change.
Next steps
You now have a working RAG pipeline that you understand end to end. Every piece is visible, every decision is yours. From here, the natural progressions are:
- Swap the vector store for Chroma or pgvector and test with a larger document corpus — our sample documents are tiny, and real-world performance depends on scale
- Add a reranker — retrieve top-20 with vector search, rerank with a cross-encoder model, keep top-3. This dramatically improves precision when your initial retrieval is noisy
- Implement hybrid search — combine vector similarity with keyword matching (BM25). Vector search handles semantic similarity; BM25 handles exact keyword matches. Together they catch cases that either misses alone
- Add metadata filtering — tag chunks with source, date, category, and filter during retrieval. A user asking about "2026 pricing" shouldn't get chunks from your 2024 pricing page
- Try query expansion — rewrite the user's query into multiple forms before searching. "How do I get a refund?" could also search for "return policy" and "money back guarantee"
- Build a proper eval suite — create a test set of question-answer pairs and run them against your pipeline automatically. Track retrieval precision and answer quality over time so you catch regressions before users do
Each of these improvements is independent — you can add a reranker without changing your chunking, or switch vector databases without touching your generation code. That's the advantage of understanding each piece: you know exactly which component to upgrade when you hit a specific limitation.
RAG is the foundation that makes AI agents actually useful with your data. Whether you're building a customer support agent, an internal knowledge assistant, or any system that needs to answer questions from a specific corpus, the pattern is the same: chunk, embed, retrieve, generate. Everything else is optimization on top of these four operations, and now you understand exactly what each one does and why.
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



