You've probably seen this happen: you ask an LLM a question about your company's product docs, and it confidently hallucinates an answer that sounds right but is completely wrong. The model doesn't know your docs exist. It's generating from its training data, not your data.
Retrieval-Augmented Generation (RAG) fixes this. Instead of hoping the model memorized the right information during training, you fetch the relevant documents first and hand them to the model as context. The model generates an answer grounded in your actual data. No fine-tuning, no retraining, no waiting weeks for a model update.
In this tutorial, you'll build a complete RAG pipeline from scratch — twice. Once in TypeScript, once in Python. No frameworks, no magic abstractions. Just embeddings, vector search, and LLM generation wired together so you understand every moving part.
What RAG Actually Does
RAG has three stages. That's it. Everything else is optimization on top of these three:
1. Indexing — Take your documents, split them into chunks, convert each chunk into a vector embedding, and store those vectors somewhere searchable.
2. Retrieval — When a user asks a question, convert that question into a vector embedding too, then find the document chunks whose vectors are closest to the question vector.
3. Generation — Take the retrieved chunks, stuff them into a prompt alongside the user's question, and send the whole thing to an LLM. The model generates an answer grounded in those specific documents.
That's the whole pattern. The reason RAG works so well is that it separates knowing where to look (retrieval) from knowing how to answer (generation). The retrieval system handles relevance. The LLM handles synthesis and language. Each does what it's best at.
For AI agents in production, RAG is what turns a generic chatbot into something that actually knows your business. An agent with RAG can reference your knowledge base, pull up specific policy documents, and give answers rooted in real information — which is exactly the kind of persistent memory that makes agents useful in the real world.
The Architecture
Here's what we're building:
Documents → Chunking → Embedding → Vector Store
↓
User Query → Embedding ──→ Similarity Search
↓
Top-K Chunks + Query → LLM → AnswerEvery piece is swappable. You can change the chunking strategy, the embedding model, the vector store, or the LLM independently. That modularity is the whole point.
Prerequisites
You'll need an OpenAI API key. The embeddings model we're using (text-embedding-3-small) costs $0.02 per million tokens — so running this tutorial will cost you a fraction of a cent. For generation, we'll use gpt-4o-mini.
Part 1: RAG in TypeScript
Let's build the TypeScript version first. Create a new project:
mkdir rag-from-scratch && cd rag-from-scratch
npm init -y
npm install openaiHere's the package.json you'll need:
{
"name": "rag-from-scratch",
"version": "1.0.0",
"type": "module",
"scripts": {
"start": "npx tsx src/rag.ts"
},
"dependencies": {
"openai": "^4.73.0"
},
"devDependencies": {
"tsx": "^4.19.0"
}
}Step 1: Chunking
First, we need to split documents into chunks. Why? Because embedding models have token limits, and smaller chunks produce more precise retrieval. If you embed an entire 50-page document as one vector, the embedding is a blurry average of everything. If you embed individual paragraphs, each vector captures a specific idea.
There are three common chunking strategies:
Fixed-size chunking — Split every N characters with overlap. Simple and predictable, but cuts mid-sentence.
Sentence-based chunking — Split on sentence boundaries. Preserves meaning but produces uneven chunk sizes.
Recursive chunking — Try splitting on paragraphs first, then sentences, then words. Keeps semantic coherence while respecting size limits. This is what LangChain uses internally, and it's what we'll build.
// Recursive character text splitter
// Tries separators in order: paragraphs → sentences → words → characters
export interface Chunk {
text: string;
index: number;
metadata?: Record<string, unknown>;
}
export function chunkText(
text: string,
options: {
maxChunkSize?: number;
overlap?: number;
separators?: string[];
} = {}
): Chunk[] {
const {
maxChunkSize = 500,
overlap = 50,
separators = ["\n\n", "\n", ". ", " "],
} = options;
const chunks: Chunk[] = [];
function splitRecursive(text: string, separatorIndex: number): string[] {
if (text.length <= maxChunkSize) return [text];
if (separatorIndex >= separators.length) {
// Last resort: hard split
const parts: string[] = [];
for (let i = 0; i < text.length; i += maxChunkSize - overlap) {
parts.push(text.slice(i, i + maxChunkSize));
}
return parts;
}
const separator = separators[separatorIndex];
const parts = text.split(separator);
const merged: string[] = [];
let current = "";
for (const part of parts) {
const candidate = current ? current + separator + part : part;
if (candidate.length > maxChunkSize && current) {
merged.push(current);
current = part;
} else {
current = candidate;
}
}
if (current) merged.push(current);
// If any chunk is still too large, split it with the next separator
const result: string[] = [];
for (const chunk of merged) {
if (chunk.length > maxChunkSize) {
result.push(...splitRecursive(chunk, separatorIndex + 1));
} else {
result.push(chunk);
}
}
return result;
}
const rawChunks = splitRecursive(text, 0);
for (let i = 0; i < rawChunks.length; i++) {
const trimmed = rawChunks[i].trim();
if (trimmed.length > 0) {
chunks.push({ text: trimmed, index: chunks.length });
}
}
return chunks;
}Step 2: Embeddings
Now we convert chunks into vectors. An embedding is just a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings have vectors that point in similar directions.
We'll use OpenAI's text-embedding-3-small, which produces 1536-dimensional vectors. It's fast, cheap, and good enough for most use cases.
import OpenAI from "openai";
const openai = new OpenAI(); // Uses OPENAI_API_KEY env var
export async function embedTexts(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
// Sort by index to maintain order
return response.data
.sort((a, b) => a.index - b.index)
.map((item) => item.embedding);
}
export async function embedQuery(query: string): Promise<number[]> {
const [embedding] = await embedTexts([query]);
return embedding;
}Step 3: Vector Store (In-Memory)
A vector store is just a collection of vectors with a way to find the nearest neighbors to a query vector. We'll start with the simplest possible implementation: an in-memory store using cosine similarity.
Cosine similarity measures the angle between two vectors. A value of 1.0 means identical direction (identical meaning), 0.0 means completely unrelated.
import { Chunk } from "./chunker.js";
export interface StoredDocument {
chunk: Chunk;
embedding: number[];
source: string;
}
export interface SearchResult {
chunk: Chunk;
score: number;
source: string;
}
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
export class VectorStore {
private documents: StoredDocument[] = [];
add(chunks: Chunk[], embeddings: number[][], source: string): void {
for (let i = 0; i < chunks.length; i++) {
this.documents.push({
chunk: chunks[i],
embedding: embeddings[i],
source,
});
}
}
search(queryEmbedding: number[], topK: number = 3): SearchResult[] {
const scored = this.documents.map((doc) => ({
chunk: doc.chunk,
source: doc.source,
score: cosineSimilarity(queryEmbedding, doc.embedding),
}));
scored.sort((a, b) => b.score - a.score);
return scored.slice(0, topK);
}
get size(): number {
return this.documents.length;
}
}This brute-force approach checks every document on every query. It's O(n) per search, which is fine for hundreds or even thousands of documents. Once you hit tens of thousands, you'll want an approximate nearest neighbor (ANN) index — which is exactly what production vector databases provide.
Step 4: Generation
Now the fun part. We take the retrieved chunks, build a prompt, and ask the LLM to answer using only the provided context. This is where the "augmented" in Retrieval-Augmented Generation happens.
import OpenAI from "openai";
import { SearchResult } from "./vector-store.js";
const openai = new OpenAI();
export interface GenerationResult {
answer: string;
sources: SearchResult[];
prompt: string;
}
export async function generate(
query: string,
results: SearchResult[],
options: { model?: string; temperature?: number } = {}
): Promise<GenerationResult> {
const { model = "gpt-4o-mini", temperature = 0.2 } = options;
const contextBlock = results
.map(
(r, i) =>
`[Source ${i + 1}] (score: ${r.score.toFixed(3)}, from: ${r.source})\n${r.chunk.text}`
)
.join("\n\n");
const systemPrompt = `You are a helpful assistant that answers questions based on the provided context documents.
Rules:
- Answer ONLY based on the provided context
- If the context doesn't contain enough information, say so
- Cite which source(s) you used with [Source N] notation
- Be concise and direct`;
const userPrompt = `Context documents:
${contextBlock}
Question: ${query}
Answer based on the context above:`;
const response = await openai.chat.completions.create({
model,
temperature,
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: userPrompt },
],
});
return {
answer: response.choices[0].message.content ?? "",
sources: results,
prompt: userPrompt,
};
}Notice the system prompt tells the model to only use the provided context. This is crucial — without it, the model will happily fill gaps with its training data, which defeats the purpose of RAG. This is a core principle of prompt engineering: explicit constraints produce more reliable behavior.
Step 5: Putting It All Together
import { chunkText } from "./chunker.js";
import { embedTexts, embedQuery } from "./embeddings.js";
import { VectorStore } from "./vector-store.js";
import { generate } from "./generator.js";
// Sample documents — imagine these come from your knowledge base
const documents = [
{
source: "product-overview.md",
content: `Chanl is an AI agent platform for building, connecting, and monitoring
customer experience agents. It supports voice and text channels. Agents can be
configured with custom prompts, knowledge bases, and tool integrations.
The platform provides real-time analytics for monitoring agent performance,
including call duration, resolution rates, and customer satisfaction scores.
Analytics dashboards show trends over time and highlight areas for improvement.
Agents connect to external systems through MCP (Model Context Protocol)
integrations. MCP allows agents to call APIs, query databases, and trigger
workflows in third-party tools without custom code.`,
},
{
source: "pricing-faq.md",
content: `Chanl offers three pricing tiers: Lite, Startup, and Business.
The Lite plan includes up to 5 agents and 1,000 interactions per month.
It costs $49/month and is designed for small teams getting started.
The Startup plan includes up to 25 agents and 10,000 interactions per month.
It costs $199/month and includes advanced analytics and priority support.
The Business plan includes unlimited agents and interactions.
Pricing is custom and includes dedicated support, SLAs, and SSO.`,
},
{
source: "memory-system.md",
content: `The memory system allows agents to remember information across conversations.
Short-term memory persists within a single conversation session.
Long-term memory stores facts about customers across multiple conversations.
Memory entries are automatically extracted from conversations and stored
as key-value pairs. For example, if a customer mentions they prefer email
communication, the agent stores this preference and uses it in future
interactions.
Memory can be managed through the API or the admin dashboard. Entries can
be viewed, edited, or deleted. Memory is scoped per customer per agent.`,
},
];
async function main() {
console.log("=== RAG Pipeline Demo ===\n");
// Step 1: Index documents
console.log("Indexing documents...");
const store = new VectorStore();
for (const doc of documents) {
const chunks = chunkText(doc.content, { maxChunkSize: 300, overlap: 30 });
const embeddings = await embedTexts(chunks.map((c) => c.text));
store.add(chunks, embeddings, doc.source);
console.log(` ${doc.source}: ${chunks.length} chunks`);
}
console.log(`\nTotal chunks in store: ${store.size}\n`);
// Step 2: Query
const queries = [
"What analytics features does Chanl provide?",
"How much does the Startup plan cost?",
"How does the memory system work?",
"Does Chanl support Salesforce integration?",
];
for (const query of queries) {
console.log(`Q: ${query}`);
// Retrieve
const queryEmbedding = await embedQuery(query);
const results = store.search(queryEmbedding, 3);
console.log(` Retrieved ${results.length} chunks:`);
for (const r of results) {
console.log(
` - [${r.source}] score: ${r.score.toFixed(3)} | "${r.chunk.text.slice(0, 60)}..."`
);
}
// Generate
const { answer } = await generate(query, results);
console.log(`\nA: ${answer}\n`);
console.log("---\n");
}
}
main().catch(console.error);Run it:
export OPENAI_API_KEY="sk-your-key-here"
npx tsx src/rag.tsYou should see the pipeline index your documents, retrieve relevant chunks for each query, and generate grounded answers. The last query — about Salesforce — is intentionally unanswerable from the provided documents. A well-configured RAG pipeline should say it doesn't have enough information rather than hallucinate.
Part 2: RAG in Python
Same pipeline, same logic, in Python. If you're more comfortable on this side of the fence, start here.
mkdir rag-from-scratch && cd rag-from-scratch
python -m venv venv && source venv/bin/activate
pip install openai numpyHere's the requirements.txt:
openai>=1.55.0
numpy>=1.26.0The Complete Python Implementation
"""
RAG from Scratch — a complete retrieval-augmented generation pipeline.
No frameworks, no magic. Just embeddings, vector search, and generation.
"""
import os
from dataclasses import dataclass, field
from openai import OpenAI
import numpy as np
client = OpenAI() # Uses OPENAI_API_KEY env var
# -- Chunking --------------------------------------------------------
@dataclass
class Chunk:
text: str
index: int
metadata: dict = field(default_factory=dict)
def chunk_text(
text: str,
max_chunk_size: int = 500,
overlap: int = 50,
separators: list[str] | None = None,
) -> list[Chunk]:
"""Recursive character text splitter."""
if separators is None:
separators = ["\n\n", "\n", ". ", " "]
def split_recursive(text: str, sep_idx: int) -> list[str]:
if len(text) <= max_chunk_size:
return [text]
if sep_idx >= len(separators):
# Hard split as last resort
return [
text[i : i + max_chunk_size]
for i in range(0, len(text), max_chunk_size - overlap)
]
parts = text.split(separators[sep_idx])
merged: list[str] = []
current = ""
for part in parts:
candidate = current + separators[sep_idx] + part if current else part
if len(candidate) > max_chunk_size and current:
merged.append(current)
current = part
else:
current = candidate
if current:
merged.append(current)
result: list[str] = []
for chunk in merged:
if len(chunk) > max_chunk_size:
result.extend(split_recursive(chunk, sep_idx + 1))
else:
result.append(chunk)
return result
raw_chunks = split_recursive(text, 0)
chunks: list[Chunk] = []
for raw in raw_chunks:
trimmed = raw.strip()
if trimmed:
chunks.append(Chunk(text=trimmed, index=len(chunks)))
return chunks
# -- Embeddings ------------------------------------------------------
def embed_texts(texts: list[str]) -> list[list[float]]:
"""Embed a batch of texts using OpenAI's text-embedding-3-small."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
sorted_data = sorted(response.data, key=lambda x: x.index)
return [item.embedding for item in sorted_data]
def embed_query(query: str) -> list[float]:
"""Embed a single query string."""
return embed_texts([query])[0]
# -- Vector Store ----------------------------------------------------
@dataclass
class StoredDocument:
chunk: Chunk
embedding: np.ndarray
source: str
@dataclass
class SearchResult:
chunk: Chunk
score: float
source: str
class VectorStore:
def __init__(self):
self.documents: list[StoredDocument] = []
def add(self, chunks: list[Chunk], embeddings: list[list[float]], source: str):
for chunk, emb in zip(chunks, embeddings):
self.documents.append(
StoredDocument(chunk=chunk, embedding=np.array(emb), source=source)
)
def search(self, query_embedding: list[float], top_k: int = 3) -> list[SearchResult]:
query_vec = np.array(query_embedding)
query_norm = np.linalg.norm(query_vec)
scored: list[SearchResult] = []
for doc in self.documents:
dot = np.dot(doc.embedding, query_vec)
doc_norm = np.linalg.norm(doc.embedding)
similarity = dot / (doc_norm * query_norm) if (doc_norm * query_norm) > 0 else 0.0
scored.append(SearchResult(chunk=doc.chunk, score=float(similarity), source=doc.source))
scored.sort(key=lambda x: x.score, reverse=True)
return scored[:top_k]
@property
def size(self) -> int:
return len(self.documents)
# -- Generation ------------------------------------------------------
def generate(
query: str,
results: list[SearchResult],
model: str = "gpt-4o-mini",
temperature: float = 0.2,
) -> dict:
"""Generate an answer using retrieved context."""
context_block = "\n\n".join(
f"[Source {i+1}] (score: {r.score:.3f}, from: {r.source})\n{r.chunk.text}"
for i, r in enumerate(results)
)
system_prompt = """You are a helpful assistant that answers questions based on the provided context documents.
Rules:
- Answer ONLY based on the provided context
- If the context doesn't contain enough information, say so
- Cite which source(s) you used with [Source N] notation
- Be concise and direct"""
user_prompt = f"""Context documents:
{context_block}
Question: {query}
Answer based on the context above:"""
response = client.chat.completions.create(
model=model,
temperature=temperature,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
return {
"answer": response.choices[0].message.content or "",
"sources": results,
"prompt": user_prompt,
}
# -- Main ------------------------------------------------------------
DOCUMENTS = [
{
"source": "product-overview.md",
"content": """Chanl is an AI agent platform for building, connecting, and monitoring
customer experience agents. It supports voice and text channels. Agents can be
configured with custom prompts, knowledge bases, and tool integrations.
The platform provides real-time analytics for monitoring agent performance,
including call duration, resolution rates, and customer satisfaction scores.
Analytics dashboards show trends over time and highlight areas for improvement.
Agents connect to external systems through MCP (Model Context Protocol)
integrations. MCP allows agents to call APIs, query databases, and trigger
workflows in third-party tools without custom code.""",
},
{
"source": "pricing-faq.md",
"content": """Chanl offers three pricing tiers: Lite, Startup, and Business.
The Lite plan includes up to 5 agents and 1,000 interactions per month.
It costs $49/month and is designed for small teams getting started.
The Startup plan includes up to 25 agents and 10,000 interactions per month.
It costs $199/month and includes advanced analytics and priority support.
The Business plan includes unlimited agents and interactions.
Pricing is custom and includes dedicated support, SLAs, and SSO.""",
},
{
"source": "memory-system.md",
"content": """The memory system allows agents to remember information across conversations.
Short-term memory persists within a single conversation session.
Long-term memory stores facts about customers across multiple conversations.
Memory entries are automatically extracted from conversations and stored
as key-value pairs. For example, if a customer mentions they prefer email
communication, the agent stores this preference and uses it in future
interactions.
Memory can be managed through the API or the admin dashboard. Entries can
be viewed, edited, or deleted. Memory is scoped per customer per agent.""",
},
]
def main():
print("=== RAG Pipeline Demo (Python) ===\n")
# Step 1: Index
print("Indexing documents...")
store = VectorStore()
for doc in DOCUMENTS:
chunks = chunk_text(doc["content"], max_chunk_size=300, overlap=30)
embeddings = embed_texts([c.text for c in chunks])
store.add(chunks, embeddings, doc["source"])
print(f" {doc['source']}: {len(chunks)} chunks")
print(f"\nTotal chunks in store: {store.size}\n")
# Step 2: Query
queries = [
"What analytics features does Chanl provide?",
"How much does the Startup plan cost?",
"How does the memory system work?",
"Does Chanl support Salesforce integration?",
]
for query in queries:
print(f"Q: {query}")
query_embedding = embed_query(query)
results = store.search(query_embedding, top_k=3)
print(f" Retrieved {len(results)} chunks:")
for r in results:
preview = r.chunk.text[:60]
print(f' - [{r.source}] score: {r.score:.3f} | "{preview}..."')
result = generate(query, results)
print(f"\nA: {result['answer']}\n")
print("---\n")
if __name__ == "__main__":
main()Run it:
export OPENAI_API_KEY="sk-your-key-here"
python rag.pyBoth implementations produce the same behavior. The Python version uses numpy for vector math, which is slightly more efficient than the manual loop in TypeScript, but for a few hundred vectors the difference is negligible.
Chunking Strategies: Which One to Pick
The chunking strategy you choose has a bigger impact on retrieval quality than most people expect. Here's a quick comparison:
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Fixed-size (every N chars) | Simple, predictable | Cuts mid-sentence, breaks meaning | Structured data, logs |
Sentence-based (split on .) | Preserves sentence meaning | Uneven sizes, some chunks too small | Clean prose, FAQs |
| Recursive (paragraph → sentence → word) | Best semantic coherence | More complex to implement | General-purpose (recommended) |
| Semantic (split when meaning shifts) | Most precise boundaries | Requires embedding each sentence first | High-quality knowledge bases |
The recursive splitter we built is the right default for most use cases. It tries to keep paragraphs together, falls back to sentences, then words. The overlap parameter ensures that context at chunk boundaries isn't completely lost.
One thing to watch: chunk size directly affects retrieval precision. Smaller chunks (200-300 tokens) are more precise but miss surrounding context. Larger chunks (500-1000 tokens) capture more context but dilute the signal. For most RAG pipelines, 300-500 tokens is the sweet spot.
Embeddings: Models and Trade-offs
We used OpenAI's text-embedding-3-small because it's the easiest to get started with. Here's how it compares to alternatives:
| Model | Dimensions | Cost | Quality | Speed |
|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | $0.02/1M tokens | Good | Fast |
| text-embedding-3-large (OpenAI) | 3072 | $0.13/1M tokens | Better | Fast |
| Voyage AI voyage-3 | 1024 | $0.06/1M tokens | Excellent for code | Fast |
| Nomic Embed (local) | 768 | Free (self-hosted) | Good | Depends on hardware |
| BGE-M3 (local) | 1024 | Free (self-hosted) | Good multilingual | Depends on hardware |
For production, the choice depends on your constraints. If you're already using OpenAI, text-embedding-3-small keeps things simple. If you need the best possible retrieval quality and don't mind the cost, text-embedding-3-large is a meaningful step up. If you can't send data to external APIs, Nomic or BGE-M3 run locally via Ollama.
One important note: you must use the same embedding model for indexing and querying. Vectors from different models live in different vector spaces and can't be compared meaningfully.
Scaling Up: Production Vector Stores
The in-memory vector store we built works for demos and small datasets. For production, you'll want a dedicated vector database. Here's the landscape:
Pinecone — Fully managed, serverless, sub-50ms latency even at billion-scale. Best for teams that don't want to manage infrastructure. Free tier available.
Chroma — Open source, Python-native, minimal setup. Great for prototyping and small to medium datasets. Can run embedded in your process or as a separate server.
pgvector — PostgreSQL extension. If you already run Postgres, this is the lowest-friction option. Competitive performance up to ~100M vectors with the pgvectorscale extension.
Weaviate — Open source with strong hybrid search (combining vector + keyword). Available as managed cloud or self-hosted.
Qdrant — Open source, Rust-based, excellent performance. Best free tier among dedicated vector databases.
Swapping from our in-memory store to any of these is straightforward — you're replacing the add() and search() methods. The chunking, embedding, and generation layers stay exactly the same. That modularity is why building from scratch first is valuable: you understand which piece does what, so upgrading one component doesn't require rethinking the whole system.
When your agents start connecting to external systems via MCP integrations and tool calls, the RAG pipeline becomes just one of several information sources. The vector store might handle product docs while an MCP tool queries live inventory data. Understanding each piece independently makes that composition straightforward.
Evaluating Your RAG Pipeline
A RAG pipeline that returns wrong answers confidently is worse than no RAG at all. You need to evaluate three things:
1. Retrieval Quality
Did the retriever find the right chunks? The simplest check: look at the similarity scores and the retrieved text. If the top chunk isn't relevant to the question, your retrieval is broken — no amount of generation quality will fix it.
2. Faithfulness
Does the generated answer actually use the retrieved context, or is the model ignoring it and hallucinating? This is the most critical evaluation.
3. Answer Quality
Is the answer correct, complete, and helpful? This is the end-to-end metric.
Here's a simple evaluation function you can add to either implementation:
import OpenAI from "openai";
import { SearchResult } from "./vector-store.js";
const openai = new OpenAI();
interface EvalResult {
relevanceScore: number;
faithfulnessScore: number;
qualityScore: number;
reasoning: string;
}
export async function evaluateResponse(
query: string,
answer: string,
retrievedChunks: SearchResult[],
referenceAnswer?: string
): Promise<EvalResult> {
const context = retrievedChunks.map((r) => r.chunk.text).join("\n\n");
const evalPrompt = `You are an evaluation judge for a RAG system. Score the following on a scale of 1-5.
Query: ${query}
Retrieved Context:
${context}
Generated Answer:
${answer}
${referenceAnswer ? `\nReference Answer: ${referenceAnswer}` : ""}
Score these three dimensions (1-5 each):
1. RELEVANCE: Are the retrieved chunks relevant to the query?
2. FAITHFULNESS: Does the answer only use information from the retrieved context? (5 = fully grounded, 1 = hallucinated)
3. QUALITY: Is the answer correct, complete, and helpful?
Respond in JSON format:
{"relevance": N, "faithfulness": N, "quality": N, "reasoning": "brief explanation"}`;
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0,
messages: [{ role: "user", content: evalPrompt }],
response_format: { type: "json_object" },
});
const parsed = JSON.parse(response.choices[0].message.content ?? "{}");
return {
relevanceScore: parsed.relevance ?? 0,
faithfulnessScore: parsed.faithfulness ?? 0,
qualityScore: parsed.quality ?? 0,
reasoning: parsed.reasoning ?? "",
};
}
// Usage: add this to your main() function
// const evalResult = await evaluateResponse(query, answer, results);
// console.log(` Eval: R=${evalResult.relevanceScore} F=${evalResult.faithfulnessScore} Q=${evalResult.qualityScore}`);
// console.log(` Reasoning: ${evalResult.reasoning}`);And the Python equivalent:
"""RAG evaluation: relevance, faithfulness, and answer quality."""
import json
from openai import OpenAI
client = OpenAI()
def evaluate_response(
query: str,
answer: str,
retrieved_chunks: list,
reference_answer: str | None = None,
) -> dict:
context = "\n\n".join(r.chunk.text for r in retrieved_chunks)
ref_section = f"\nReference Answer: {reference_answer}" if reference_answer else ""
eval_prompt = f"""You are an evaluation judge for a RAG system. Score the following on a scale of 1-5.
Query: {query}
Retrieved Context:
{context}
Generated Answer:
{answer}
{ref_section}
Score these three dimensions (1-5 each):
1. RELEVANCE: Are the retrieved chunks relevant to the query?
2. FAITHFULNESS: Does the answer only use information from the retrieved context? (5 = fully grounded, 1 = hallucinated)
3. QUALITY: Is the answer correct, complete, and helpful?
Respond in JSON format:
{{"relevance": N, "faithfulness": N, "quality": N, "reasoning": "brief explanation"}}"""
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[{"role": "user", "content": eval_prompt}],
response_format={"type": "json_object"},
)
parsed = json.loads(response.choices[0].message.content or "{}")
return {
"relevance": parsed.get("relevance", 0),
"faithfulness": parsed.get("faithfulness", 0),
"quality": parsed.get("quality", 0),
"reasoning": parsed.get("reasoning", ""),
}
# Usage: add to your main() function
# eval_result = evaluate_response(query, result["answer"], results)
# print(f" Eval: R={eval_result['relevance']} F={eval_result['faithfulness']} Q={eval_result['quality']}")
# print(f" Reasoning: {eval_result['reasoning']}")This uses LLM-as-judge evaluation — the same approach used by RAG evaluation frameworks like RAGAS and DeepEval. It's not perfect (the judge can be wrong), but it's the fastest way to get automated quality signals. For production systems, you'd combine this with human evaluation and analytics dashboards that track quality metrics over time.
Common RAG Failure Modes (And How to Fix Them)
Once you have a working pipeline, here's what typically goes wrong:
The retriever finds irrelevant chunks. Your chunks are too large, or your embeddings don't capture the right semantics. Fix: smaller chunks, try a better embedding model, or add metadata filtering (e.g., only search within a specific document category).
The model ignores the context. This usually means your prompt isn't constraining the model enough, or the context is so long that the model "loses" the relevant information in the middle. Fix: tighten your system prompt, reduce the number of retrieved chunks, or put the most relevant chunk last (models attend more to the beginning and end of context).
Answers are correct but miss important details. Your top-K is too low, or the relevant information is spread across chunks that don't get retrieved together. Fix: increase top-K, use a reranker to promote better chunks, or try larger chunks with more overlap.
Performance is slow. Embedding the query + searching + generating takes too long. Fix: cache frequently-asked-question embeddings, use a vector database with ANN indexing instead of brute-force, and consider a smaller/faster generation model.
What's Next
You now have a working RAG pipeline that you understand end to end. Every piece is visible, every decision is yours. From here, the natural progressions are:
- Swap the vector store for Chroma or pgvector and test with a larger document corpus
- Add a reranker — retrieve top-20, rerank with a cross-encoder, keep top-3
- Implement hybrid search — combine vector similarity with keyword matching (BM25)
- Add metadata filtering — tag chunks with source, date, category, and filter during retrieval
- Try query expansion — rewrite the user's query into multiple forms before searching
RAG is the foundation that makes AI agents actually useful with your data. Whether you're building a customer support agent, an internal knowledge assistant, or any system that needs to answer questions from a specific corpus, the pattern is the same: chunk, embed, retrieve, generate. Everything else is optimization.
Chanl Team
AI Agent Testing Platform
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Get AI Agent Insights
Subscribe to our newsletter for weekly tips and best practices.



