What chunk size should I use for RAG with OpenAI embeddings?

300–500 tokens per chunk is the sweet spot. Smaller chunks produce more precise retrieval, while larger ones capture more context. Start at 300 with 50-token overlap and adjust based on your retrieval scores.

Can I run a RAG pipeline without sending data to OpenAI?

Yes. Use a local embedding model like Nomic Embed or BGE-M3 via Ollama, paired with a local LLM. You lose some quality versus OpenAI's models, but your data never leaves your infrastructure.

How do I stop my RAG pipeline from hallucinating?

Constrain the system prompt to only use retrieved context, set temperature low (0.1–0.3), and add an automated faithfulness check that scores whether the answer sticks to provided chunks. If it scores below your threshold, return a fallback instead.

What's the cheapest way to embed a large document set?

text-embedding-3-small at $0.02 per million tokens. A million 500-token documents costs about $10 to embed. You only re-embed when documents change, so it's a one-time cost.

When should I switch from an in-memory vector store to a database?

At around 10,000–50,000 chunks, brute-force cosine similarity gets slow. At that point, move to pgvector (if you already run Postgres), Pinecone (fully managed), or Qdrant (best open-source option).

How do I evaluate whether my RAG retrieval is working?

Log the retrieved chunks and similarity scores for every query. If your top chunk scores below 0.7 or the text is clearly irrelevant, your chunking or embedding model needs work. LLM-as-judge can automate this at scale.

Do I need a reranker, or is top-K retrieval enough?

Top-K alone works for simple use cases. Add a reranker when you notice your correct chunk lands at position 5–10 instead of the top 3. The typical pattern: retrieve top-20 with vector search, rerank with a cross-encoder, keep top-3.

RAG from Scratch: Build a Retrieval-Augmented Generation Pipeline

Ask an LLM about your company's return policy and it'll confidently make one up. The model doesn't know your docs exist — it's generating from training data, not your data.

Retrieval-Augmented Generation (RAG) fixes this. Instead of hoping the model memorized the right information during training, you fetch the relevant documents first and hand them to the model as context. The model generates an answer grounded in your actual data. No fine-tuning, no retraining, no waiting weeks for a model update.

Here you'll build a complete RAG pipeline from scratch in TypeScript: chunking, embeddings, vector search, and generation. No framework abstractions hiding the moving parts — just the raw components wired together so you understand every piece.

Pipeline Stage	What it does	Key decision
Chunking	Split documents into searchable pieces	Recursive splitting at 300–500 tokens (best default)
Embedding	Convert text chunks into vector representations	`text-embedding-3-small` for cost; `3-large` for quality
Vector store	Store and search embeddings by similarity	In-memory for prototyping; Pinecone/pgvector for production
Retrieval	Find top-K chunks closest to the user's query	Top 3–5 chunks balances precision and context coverage
Generation	LLM answers using only the retrieved context	Constrain the prompt to prevent hallucination beyond context
Evaluation	Score relevance, faithfulness, and answer quality	LLM-as-judge with structured rubric

What RAG actually does

Retrieval-Augmented Generation works in three stages: index your documents as vector embeddings, retrieve the most relevant chunks for a given query using similarity search, then generate an answer by feeding those chunks as context to an LLM. Everything else is optimization on top of these three.

1. Indexing — Take your documents, split them into chunks, convert each chunk into a vector embedding, and store those vectors somewhere searchable.

2. Retrieval — When a user asks a question, convert that question into a vector embedding too, then find the document chunks whose vectors are closest to the question vector.

3. Generation — Take the retrieved chunks, stuff them into a prompt alongside the user's question, and send the whole thing to an LLM. The model generates an answer grounded in those specific documents.

That's the whole pattern. The reason RAG works so well is that it separates knowing where to look (retrieval) from knowing how to answer (generation). The retrieval system handles relevance. The LLM handles synthesis and language. Each does what it's best at.

Why not just fine-tune?

Fine-tuning bakes knowledge directly into the model's weights. That sounds appealing until you consider the tradeoffs:

Staleness. Fine-tuned knowledge is frozen at training time. When your docs change, you retrain. With RAG, you just re-index the changed documents.
Cost. Fine-tuning GPT-4o costs ~$25 per million training tokens, takes hours, and you pay again every time your data changes. RAG embedding costs pennies per million tokens and takes seconds.
Traceability. With RAG, you can show exactly which documents produced an answer. With fine-tuning, the model's reasoning is opaque — you can't point to a source.
Hallucination control. Fine-tuned models still hallucinate. RAG gives you a concrete mechanism to constrain answers to retrieved context.

Fine-tuning is useful for teaching a model a new style or behavior (e.g., always respond in a specific format). RAG is for giving a model access to specific knowledge. Most teams need RAG, not fine-tuning. Some need both.

What about long context windows?

GPT-4o supports 128K tokens of context. Claude supports 200K. Can't you just dump all your docs into the prompt and skip retrieval entirely?

You can, and for small document sets it works. But there are three problems:

Cost scales linearly. Every query pays for the full context window. Sending 100K tokens per query at $2.50/1M input tokens means $0.25 per question. RAG sends only the relevant 1-2K tokens, cutting cost by 50-100x.
Latency increases. More input tokens = slower responses. A 100K token prompt takes noticeably longer than a 2K token prompt with three targeted chunks.
Accuracy degrades. Research consistently shows that LLMs perform worse at finding relevant information in the middle of very long contexts. RAG pre-filters to just the relevant chunks, so the model doesn't have to search.

The practical rule: if your entire knowledge base fits in 10-20K tokens and doesn't change often, just stuff it in the prompt. Beyond that, RAG is more cost-effective, faster, and more accurate.

There's also a hybrid approach: use RAG to retrieve relevant chunks, but include a broader "context summary" in every prompt (a 500-token overview of your product or domain). This gives the model general awareness while RAG provides specific details. Think of it as the model knowing the table of contents of your knowledge base, while RAG retrieves the specific pages.

RAG for production AI agents

For AI agents in production, RAG is what turns a generic chatbot into something that actually knows your business. An agent with RAG can reference your knowledge base, pull up specific policy documents, and give answers rooted in real information — which is exactly the kind of persistent memory that makes agents useful in the real world.

Without RAG, an agent answering "What's our refund policy for enterprise customers?" has to guess based on its training data. With RAG, it retrieves your actual refund policy document and quotes the relevant section. The difference between those two experiences is the difference between a toy demo and a production tool.

The RAG architecture

A RAG pipeline flows in two directions: documents go through chunking, embedding, and storage at index time, while user queries go through embedding, similarity search against stored vectors, and LLM generation at query time. Every component is independently swappable.

Here's what we're building:

The two-phase RAG pipeline: documents are indexed offline, queries are answered in real time

Every piece is swappable. You can change the chunking strategy, the embedding model, the vector store, or the LLM independently. That modularity is the whole point — and it's why building from scratch first is so valuable. When you use a framework like LangChain or LlamaIndex, these pieces are hidden behind abstractions. Building them yourself means you understand exactly where to look when something breaks.

Prerequisites

You'll need an OpenAI API key. The embeddings model we're using (text-embedding-3-small) costs $0.02 per million tokens — running this tutorial costs a fraction of a cent. For generation, we'll use gpt-4o-mini.

Building the pipeline

We'll build four modules — chunker, embedder, vector store, and generator — then wire them together in a main file. Each module is independent and has a single responsibility, which makes it easy to swap components later.

The file structure looks like this:

text

rag-from-scratch/
  src/
    chunker.ts       # Split documents into chunks
    embeddings.ts    # Convert text to vectors via OpenAI
    vector-store.ts  # In-memory store with cosine similarity
    generator.ts     # Prompt construction and LLM generation
    rag.ts           # Main pipeline: index + query
  package.json

Create a new project:

bash

mkdir rag-from-scratch && cd rag-from-scratch
npm init -y
npm install openai

Here's the package.json you'll need:

json

{
  "name": "rag-from-scratch",
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "start": "npx tsx src/rag.ts"
  },
  "dependencies": {
    "openai": "^4.73.0"
  },
  "devDependencies": {
    "tsx": "^4.19.0"
  }
}

Step 1: Chunking

First, we need to split documents into chunks. Why? Because embedding models have token limits, and smaller chunks produce more precise retrieval. If you embed an entire 50-page document as one vector, the embedding is a blurry average of everything in that document. If you embed individual paragraphs, each vector captures a specific idea — and retrieval can pinpoint exactly the right paragraph.

There are three common chunking strategies:

Fixed-size chunking — Split every N characters with overlap. Simple and predictable, but cuts mid-sentence. Works well for structured data like logs or CSVs where sentence boundaries don't matter much.

Sentence-based chunking — Split on sentence boundaries. Preserves meaning but produces uneven chunk sizes — some chunks end up with a single short sentence, others with a long paragraph.

Recursive chunking — Try splitting on paragraphs first, then sentences, then words. Keeps semantic coherence while respecting size limits. This is what LangChain uses internally, and it's what we'll build.

The recursive approach works by trying the largest separator first (double newline for paragraphs). If a resulting segment is still too large, it falls back to the next separator (single newline, then sentence boundaries, then spaces). This preserves the natural structure of your documents as much as possible.

typescript

// Recursive character text splitter
// Tries separators in order: paragraphs → sentences → words → characters
 
export interface Chunk {
  text: string;
  index: number;
  metadata?: Record<string, unknown>;
}
 
export function chunkText(
  text: string,
  options: {
    maxChunkSize?: number;
    overlap?: number;
    separators?: string[];
  } = {}
): Chunk[] {
  const {
    maxChunkSize = 500,
    overlap = 50,
    separators = ["\n\n", "\n", ". ", " "],
  } = options;
 
  const chunks: Chunk[] = [];
 
  function splitRecursive(text: string, separatorIndex: number): string[] {
    if (text.length <= maxChunkSize) return [text];
    if (separatorIndex >= separators.length) {
      // Last resort: hard split
      const parts: string[] = [];
      for (let i = 0; i < text.length; i += maxChunkSize - overlap) {
        parts.push(text.slice(i, i + maxChunkSize));
      }
      return parts;
    }
 
    const separator = separators[separatorIndex];
    const parts = text.split(separator);
 
    const merged: string[] = [];
    let current = "";
 
    for (const part of parts) {
      const candidate = current ? current + separator + part : part;
      if (candidate.length > maxChunkSize && current) {
        merged.push(current);
        current = part;
      } else {
        current = candidate;
      }
    }
    if (current) merged.push(current);
 
    // If any chunk is still too large, split it with the next separator
    const result: string[] = [];
    for (const chunk of merged) {
      if (chunk.length > maxChunkSize) {
        result.push(...splitRecursive(chunk, separatorIndex + 1));
      } else {
        result.push(chunk);
      }
    }
    return result;
  }
 
  const rawChunks = splitRecursive(text, 0);
 
  for (let i = 0; i < rawChunks.length; i++) {
    const trimmed = rawChunks[i].trim();
    if (trimmed.length > 0) {
      chunks.push({ text: trimmed, index: chunks.length });
    }
  }
 
  return chunks;
}

The metadata field on each chunk is empty here, but it becomes important in production. You'd attach the source document name, section heading, page number, creation date — anything that helps you filter or rank results later. When a user asks about pricing, metadata filters can restrict the search to documents tagged "pricing" before the vector comparison even runs. Here's what production metadata typically looks like:

typescript

chunks.push({
  text: trimmed,
  index: chunks.length,
  metadata: {
    source: "pricing-faq.md",
    section: "Enterprise Plan",
    lastUpdated: "2026-02-15",
    category: "pricing",
    accessLevel: "public",
  },
});

This lets you do things like "only search documents updated in the last 6 months" or "only search documents the current user has access to" — critical for production systems.

The overlap parameter also deserves a note. When you split text into chunks, you lose context at the boundaries. A fact that spans two paragraphs might get cut in half. Overlap mitigates this by repeating the last N characters of each chunk at the start of the next one. Fifty characters of overlap is a reasonable default — enough to preserve boundary context without inflating your chunk count too much.

Step 2: Embeddings

Now we convert chunks into vectors. An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings have vectors that point in similar directions — "return policy" and "refund guidelines" would produce vectors that are close together, even though they share no exact words.

This is what makes RAG fundamentally different from keyword search. Traditional search requires exact word matches. Embedding-based search understands meaning. A user asking "Can I get my money back?" will match a document about "refund policies" because the embeddings capture semantic similarity, not lexical overlap.

We'll use OpenAI's text-embedding-3-small, which produces 1536-dimensional vectors. Each dimension captures some aspect of the text's meaning — the model learned these dimensions during training on billions of text pairs. You can think of each dimension as a slider on a mixing board, and the full 1536-dimension vector as a unique "fingerprint" of the text's meaning.

typescript

import OpenAI from "openai";
 
const openai = new OpenAI(); // Uses OPENAI_API_KEY env var
 
export async function embedTexts(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
 
  // Sort by index to maintain order
  return response.data
    .sort((a, b) => a.index - b.index)
    .map((item) => item.embedding);
}
 
export async function embedQuery(query: string): Promise<number[]> {
  const [embedding] = await embedTexts([query]);
  return embedding;
}

We separate embedTexts (batch) from embedQuery (single) for clarity. The batch function accepts an array of texts and returns an array of vectors in the same order — this is important because OpenAI processes them more efficiently in a single API call than in multiple individual calls.

In production, you'd want to handle larger document sets by batching. OpenAI's API accepts up to 2048 texts per call, so for a 10,000-chunk corpus you'd split into 5 batches:

typescript

async function embedBatch(texts: string[], batchSize = 2048): Promise<number[][]> {
  const allEmbeddings: number[][] = [];
  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize);
    const embeddings = await embedTexts(batch);
    allEmbeddings.push(...embeddings);
  }
  return allEmbeddings;
}

You'd also want retry logic for rate limits — OpenAI returns 429 errors when you exceed your tokens-per-minute quota. A simple exponential backoff handles this gracefully.

The sort-by-index step in embedTexts matters because OpenAI's API doesn't guarantee response order matches input order. Without it, your embeddings could get shuffled relative to your chunks — and you'd silently store the wrong vector for each chunk. This is the kind of bug that's extremely hard to debug because everything appears to work, just with slightly worse retrieval quality.

Step 3: Vector store (in-memory)

A vector store is just a collection of vectors with a way to find the nearest neighbors to a query vector. We'll start with the simplest possible implementation: an in-memory store using cosine similarity.

Cosine similarity measures the angle between two vectors. A value of 1.0 means identical direction (identical meaning), 0.0 means completely unrelated. It ignores vector magnitude, so it works regardless of whether your embeddings are normalized. This is important because different embedding models produce vectors with different magnitudes — cosine similarity gives you a consistent comparison metric.

Why cosine similarity over other distance metrics? There are three common options:

Cosine similarity — Measures the angle between vectors. Range: -1 to 1, where 1 = identical direction. Ignores magnitude.
Euclidean distance (L2) — Measures the straight-line distance between two points. Smaller = more similar. Sensitive to magnitude.
Dot product — Measures both direction and magnitude. Faster to compute but results depend on vector norms.

For normalized embeddings (which OpenAI's are), all three give the same ranking. But cosine similarity is the standard for text embeddings because it's invariant to vector length, making it more robust across different embedding models and document lengths.

typescript

import { Chunk } from "./chunker.js";
 
export interface StoredDocument {
  chunk: Chunk;
  embedding: number[];
  source: string;
}
 
export interface SearchResult {
  chunk: Chunk;
  score: number;
  source: string;
}
 
function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
 
export class VectorStore {
  private documents: StoredDocument[] = [];
 
  add(chunks: Chunk[], embeddings: number[][], source: string): void {
    for (let i = 0; i < chunks.length; i++) {
      this.documents.push({
        chunk: chunks[i],
        embedding: embeddings[i],
        source,
      });
    }
  }
 
  search(queryEmbedding: number[], topK: number = 3): SearchResult[] {
    const scored = this.documents.map((doc) => ({
      chunk: doc.chunk,
      source: doc.source,
      score: cosineSimilarity(queryEmbedding, doc.embedding),
    }));
 
    scored.sort((a, b) => b.score - a.score);
    return scored.slice(0, topK);
  }
 
  get size(): number {
    return this.documents.length;
  }
}

This brute-force approach checks every document on every query. It's O(n) per search, which is fine for hundreds or even thousands of documents. Once you hit tens of thousands, you'll want an approximate nearest neighbor (ANN) index — which is exactly what production vector databases provide.

ANN algorithms like HNSW (Hierarchical Navigable Small World) build a graph structure over your vectors during indexing. At query time, they navigate this graph to find approximate nearest neighbors in O(log n) instead of comparing against every vector. The tradeoff is a small accuracy loss (typically 95-99% recall) for dramatically faster search — milliseconds instead of seconds at scale.

For our tutorial's three documents with six chunks, brute-force is instantaneous. But if you're indexing 100,000 support articles, each search would compare against every one of those vectors. At that scale, a dedicated vector database with HNSW indexing returns results in under 50ms where brute-force would take seconds.

Step 4: Generation

Now the fun part. We take the retrieved chunks, build a prompt, and ask the LLM to answer using only the provided context. This is where the "augmented" in Retrieval-Augmented Generation happens — the model's generation is augmented with retrieved information.

typescript

import OpenAI from "openai";
import { SearchResult } from "./vector-store.js";
 
const openai = new OpenAI();
 
export interface GenerationResult {
  answer: string;
  sources: SearchResult[];
  prompt: string;
}
 
export async function generate(
  query: string,
  results: SearchResult[],
  options: { model?: string; temperature?: number } = {}
): Promise<GenerationResult> {
  const { model = "gpt-4o-mini", temperature = 0.2 } = options;
 
  const contextBlock = results
    .map(
      (r, i) =>
        `[Source ${i + 1}] (score: ${r.score.toFixed(3)}, from: ${r.source})\n${r.chunk.text}`
    )
    .join("\n\n");
 
  const systemPrompt = `You are a helpful assistant that answers questions based on the provided context documents.
 
Rules:
- Answer ONLY based on the provided context
- If the context doesn't contain enough information, say so
- Cite which source(s) you used with [Source N] notation
- Be concise and direct`;
 
  const userPrompt = `Context documents:
${contextBlock}
 
Question: ${query}
 
Answer based on the context above:`;
 
  const response = await openai.chat.completions.create({
    model,
    temperature,
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userPrompt },
    ],
  });
 
  return {
    answer: response.choices[0].message.content ?? "",
    sources: results,
    prompt: userPrompt,
  };
}

A few design choices worth noting here:

The system prompt explicitly tells the model to only use the provided context. This is crucial — without it, the model will happily fill gaps with its training data, which defeats the purpose of RAG. This is a core prompt engineering principle: explicit constraints produce more reliable behavior.

We include the similarity score in each source block. This is useful for debugging — if your top chunk has a score of 0.65, that's a signal your retrieval might not be finding great matches, even before you look at the generated answer.

Temperature is set to 0.2, which keeps the model's output focused and deterministic. Higher temperatures (0.7+) produce more creative responses, but for factual RAG answers you want consistency. If you ask the same question twice, you should get substantially the same answer.

We return the full prompt alongside the answer. This makes debugging much easier — you can see exactly what the model received and why it responded the way it did. In production, logging the prompt, the retrieved chunks, the similarity scores, and the generated answer gives you a complete audit trail for every response.

The prompt structure itself matters more than you might think. We put the context before the question, which works well for most models. Some teams find that putting the question first ("Given this question: X, use the following context to answer:") works better for their use case. The [Source N] citation format makes it easy for users to verify answers against the original documents — transparency that builds trust in RAG-powered systems.

The top-K parameter (how many chunks to retrieve) creates a direct tradeoff between context coverage and prompt cost. More chunks means the model has more information to work with, but also more tokens to pay for and more potential for the model to get confused by irrelevant context. A good starting point:

K=3 for focused, single-topic questions ("What's the refund policy?")
K=5 for broader questions that might span multiple documents ("Give me an overview of the pricing tiers and what each includes")
K=1 for simple lookup questions where you just need the closest match ("What's the phone number for support?")

Step 5: Putting it all together

Now we wire up all four components — chunker, embedder, vector store, and generator — into a complete pipeline. The indexing phase runs once (or whenever documents change), while the query phase runs for every user question.

The sample documents below simulate a real knowledge base with three types of content: product overview (features), pricing FAQ (numbers and plans), and technical documentation (the memory system). This variety lets us test whether retrieval correctly routes different types of questions to the right source documents.

typescript

import { chunkText } from "./chunker.js";
import { embedTexts, embedQuery } from "./embeddings.js";
import { VectorStore } from "./vector-store.js";
import { generate } from "./generator.js";
 
// Sample documents — imagine these come from your knowledge base
const documents = [
  {
    source: "product-overview.md",
    content: `Chanl is an AI agent platform for building, connecting, and monitoring
customer experience agents. It supports voice and text channels. Agents can be
configured with custom prompts, knowledge bases, and tool integrations.
 
The platform provides real-time analytics for monitoring agent performance,
including call duration, resolution rates, and customer satisfaction scores.
Analytics dashboards show trends over time and highlight areas for improvement.
 
Agents connect to external systems through MCP (Model Context Protocol)
integrations. MCP allows agents to call APIs, query databases, and trigger
workflows in third-party tools without custom code.`,
  },
  {
    source: "pricing-faq.md",
    content: `Chanl offers three pricing tiers: Lite, Startup, and Business.
 
The Lite plan includes up to 5 agents and 1,000 interactions per month.
It costs $49/month and is designed for small teams getting started.
 
The Startup plan includes up to 25 agents and 10,000 interactions per month.
It costs $199/month and includes advanced analytics and priority support.
 
The Business plan includes unlimited agents and interactions.
Pricing is custom and includes dedicated support, SLAs, and SSO.`,
  },
  {
    source: "memory-system.md",
    content: `The memory system allows agents to remember information across conversations.
Short-term memory persists within a single conversation session.
Long-term memory stores facts about customers across multiple conversations.
 
Memory entries are automatically extracted from conversations and stored
as key-value pairs. For example, if a customer mentions they prefer email
communication, the agent stores this preference and uses it in future
interactions.
 
Memory can be managed through the API or the admin dashboard. Entries can
be viewed, edited, or deleted. Memory is scoped per customer per agent.`,
  },
];
 
async function main() {
  console.log("=== RAG Pipeline Demo ===\n");
 
  // Step 1: Index documents
  console.log("Indexing documents...");
  const store = new VectorStore();
 
  for (const doc of documents) {
    const chunks = chunkText(doc.content, { maxChunkSize: 300, overlap: 30 });
    const embeddings = await embedTexts(chunks.map((c) => c.text));
    store.add(chunks, embeddings, doc.source);
    console.log(`  ${doc.source}: ${chunks.length} chunks`);
  }
 
  console.log(`\nTotal chunks in store: ${store.size}\n`);
 
  // Step 2: Query
  const queries = [
    "What analytics features does Chanl provide?",
    "How much does the Startup plan cost?",
    "How does the memory system work?",
    "Does Chanl support Salesforce integration?",
  ];
 
  for (const query of queries) {
    console.log(`Q: ${query}`);
 
    // Retrieve
    const queryEmbedding = await embedQuery(query);
    const results = store.search(queryEmbedding, 3);
 
    console.log(`  Retrieved ${results.length} chunks:`);
    for (const r of results) {
      console.log(
        `    - [${r.source}] score: ${r.score.toFixed(3)} | "${r.chunk.text.slice(0, 60)}..."`
      );
    }
 
    // Generate
    const { answer } = await generate(query, results);
    console.log(`\nA: ${answer}\n`);
    console.log("---\n");
  }
}
 
main().catch(console.error);

Run it:

bash

export OPENAI_API_KEY="sk-your-key-here"
npx tsx src/rag.ts

You should see the pipeline index your documents, retrieve relevant chunks for each query, and generate grounded answers. Here's what the output looks like:

text

=== RAG Pipeline Demo ===
 
Indexing documents...
  product-overview.md: 2 chunks
  pricing-faq.md: 2 chunks
  memory-system.md: 2 chunks
 
Total chunks in store: 6
 
Q: What analytics features does Chanl provide?
  Retrieved 3 chunks:
    - [product-overview.md] score: 0.847 | "The platform provides real-time analytics for monitoring..."
    - [product-overview.md] score: 0.762 | "Chanl is an AI agent platform for building, connecting..."
    - [pricing-faq.md] score: 0.643 | "The Startup plan includes up to 25 agents and 10,000..."
 
A: Chanl provides real-time analytics for monitoring agent performance,
including call duration, resolution rates, and customer satisfaction scores.
Analytics dashboards show trends over time and highlight areas for improvement
[Source 1].

Pay attention to the similarity scores — they tell you how confident the retrieval is for each chunk. In this example, the top chunk scores 0.847 (strong match), the second is 0.762 (good supporting context), and the third at 0.643 is a weaker match that was pulled in because it mentions analytics tangentially.

Notice the four queries test different aspects of the pipeline. The first three have clear answers in the documents. The last query — about Salesforce — is intentionally unanswerable from the provided context. A well-configured RAG pipeline should say it doesn't have enough information rather than hallucinate. If your pipeline makes up a Salesforce answer, your system prompt needs tightening.

This is a good sanity check for any RAG system: always include at least one question that can't be answered from the context. If the model answers it anyway, you've got a faithfulness problem.

Choosing a chunking strategy

The chunking strategy you choose has a bigger impact on retrieval quality than most people expect. Recursive chunking is your best default — it tries paragraphs first, falls back to sentences, then words, preserving semantic coherence while respecting size limits.

Strategy	Pros	Cons	Best for
Fixed-size (every N chars)	Simple, predictable	Cuts mid-sentence, breaks meaning	Structured data, logs
Sentence-based (split on `.`)	Preserves sentence meaning	Uneven sizes, some chunks too small	Clean prose, FAQs
Recursive (paragraph → sentence → word)	Best semantic coherence	More complex to implement	General-purpose (recommended)
Semantic (split when meaning shifts)	Most precise boundaries	Requires embedding each sentence first	High-quality knowledge bases

The recursive splitter we built is the right default for most use cases. It tries to keep paragraphs together, falls back to sentences, then words. The overlap parameter ensures that context at chunk boundaries isn't completely lost.

Chunk size directly affects retrieval precision. Smaller chunks (200–300 tokens) are more precise but miss surrounding context. Larger chunks (500–1000 tokens) capture more context but dilute the signal — the embedding becomes an average of too many ideas. 300–500 tokens hits the sweet spot for most pipelines.

There's a fourth strategy worth mentioning: semantic chunking. Instead of splitting on character boundaries, you embed every sentence, then split where the cosine similarity between consecutive sentences drops below a threshold. This produces chunks that follow the natural topic boundaries of the text. The downside is that you have to embed every sentence during indexing (more API calls), so it's more expensive. But for high-value knowledge bases where retrieval quality is critical, it can meaningfully improve results.

To illustrate why strategy matters, consider this document:

Our enterprise plan includes dedicated support with a 4-hour SLA. All enterprise customers get SSO integration. Pricing is based on usage volume and starts at $500/month.

With fixed-size chunking at 80 characters, this might split into:

Chunk 1: "Our enterprise plan includes dedicated support with a 4-hour SLA. All ente"
Chunk 2: "rprise customers get SSO integration. Pricing is based on usage volume and"
Chunk 3: " starts at $500/month."

A query about "enterprise pricing" would match Chunk 3 (which mentions $500 but lacks context) and possibly Chunk 2 (which mentions pricing but cuts off). The recursive splitter keeps this as a single chunk because it's under the size limit, so all three facts stay together.

A practical way to validate your chunking: run 20 representative queries and manually check whether the right chunk lands in the top 3 results. If relevant information keeps getting split across chunks or buried in oversized ones, adjust your size and overlap parameters.

Choosing an embedding model

We used OpenAI's text-embedding-3-small because it's the easiest to get started with. Here's how it compares to the alternatives:

Model	Dimensions	Cost	Quality	Speed
text-embedding-3-small (OpenAI)	1536	$0.02/1M tokens	Good	Fast
text-embedding-3-large (OpenAI)	3072	$0.13/1M tokens	Better	Fast
Voyage AI voyage-3	1024	$0.06/1M tokens	Excellent for code	Fast
Nomic Embed (local)	768	Free (self-hosted)	Good	Depends on hardware
BGE-M3 (local)	1024	Free (self-hosted)	Good multilingual	Depends on hardware

For most teams, text-embedding-3-small is the right default — fast, cheap, and good enough. If you're already using OpenAI, it keeps things simple. If you need the best possible retrieval quality and don't mind the cost, text-embedding-3-large is a meaningful step up — the extra 1536 dimensions capture finer-grained semantic distinctions. If you can't send data to external APIs, Nomic or BGE-M3 run locally via Ollama.

Voyage AI deserves special mention if your documents contain code. Their voyage-3 model was trained specifically on code and technical text, so it outperforms OpenAI's models for code search and technical documentation retrieval.

One critical rule: you must use the same embedding model for indexing and querying. Vectors from different models live in different vector spaces and can't be compared meaningfully. If you switch embedding models, you need to re-embed your entire document set. This is also why it's worth choosing carefully upfront — re-embedding a million documents isn't free.

Cost estimation

Here's a quick way to estimate your embedding costs. A typical English word is about 1.3 tokens. A 500-word chunk is ~650 tokens.

Corpus size	Chunks (at 500 tokens each)	Embedding cost (3-small)	Embedding cost (3-large)
100 pages	~200 chunks	$0.006	$0.04
1,000 pages	~2,000 chunks	$0.06	$0.40
10,000 pages	~20,000 chunks	$0.60	$4.00
100,000 pages	~200,000 chunks	$6.00	$40.00

You only pay for embedding once per document. Re-embedding happens only when documents change. Query embedding costs are negligible — one query is a single API call of ~20 tokens.

Dimension reduction

OpenAI's text-embedding-3-* models support a dimensions parameter that lets you reduce the output size. You can request 256 or 512 dimensions instead of the full 1536. Smaller vectors mean faster search and less storage, at the cost of some retrieval quality. For prototyping or very large corpora where storage cost matters, this is a useful lever.

typescript

const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: texts,
  dimensions: 512, // Reduced from 1536
});

At 512 dimensions, your vectors use 3x less memory and searches run faster, while retrieval quality drops only 1-3% for most use cases. This is a good option when you're indexing millions of documents and storage cost is a real concern. You can test by running your eval suite at 1536 vs 512 dimensions and measuring the actual quality difference for your specific data.

Choosing a vector database

The in-memory vector store we built works for demos and small datasets. For production, you'll want a dedicated vector database that handles persistence, scaling, metadata filtering, and efficient approximate nearest-neighbor search. Here's the landscape:

Pinecone — Fully managed, serverless, sub-50ms latency even at billion-scale. Best for teams that don't want to manage infrastructure. Free tier available.

Chroma — Open source, Python-native, minimal setup. Great for prototyping and small to medium datasets. Can run embedded in your process or as a separate server.

pgvector — PostgreSQL extension. If you already run Postgres, this is the lowest-friction option. Competitive performance up to ~100M vectors with the pgvectorscale extension. The biggest advantage: your vectors live alongside your relational data, so you can JOIN against metadata without a second database. Your chunks, embeddings, and document metadata all live in the same Postgres instance, queryable with standard SQL.

Example pgvector query for context:

sql

SELECT chunk_text, source, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE category = 'pricing'
ORDER BY embedding <=> $1
LIMIT 3;

The <=> operator computes cosine distance. Metadata filtering (WHERE category = 'pricing') happens before the vector search, which is exactly the kind of scoping that improves retrieval precision.

Weaviate — Open source with strong hybrid search (combining vector + keyword). Available as managed cloud or self-hosted.

Qdrant — Open source, Rust-based, excellent performance. Best free tier among dedicated vector databases.

Use pgvector if you already run PostgreSQL, Pinecone if you want fully managed with zero infrastructure, or Qdrant for the best open-source option with a generous free tier.

Here's a quick decision matrix:

If you...	Use
Already run PostgreSQL	pgvector — no new infrastructure
Want zero ops	Pinecone — fully managed, serverless
Need open-source + self-hosted	Qdrant — best performance, Rust-based
Are prototyping in Python	Chroma — fastest to get started
Need hybrid search built in	Weaviate — vector + keyword out of the box

Swapping from our in-memory store to any of these is straightforward — you're replacing the add() and search() methods. The chunking, embedding, and generation layers stay exactly the same. That modularity is why building from scratch first is valuable: you understand which piece does what, so upgrading one component doesn't require rethinking the whole system.

When your agents start connecting to external systems via MCP integrations and tool calls, the RAG pipeline becomes just one of several information sources. The vector store might handle product docs while an MCP tool queries live inventory data. Understanding each piece independently makes that composition straightforward.

Evaluating your pipeline

A RAG pipeline that returns wrong answers confidently is worse than no RAG at all. You need to measure three things.

Retrieval quality

Did the retriever find the right chunks? The simplest check: look at the similarity scores and the retrieved text. If the top chunk isn't relevant to the question, your retrieval is broken — no amount of generation quality fixes that.

A score above 0.8 usually means strong relevance. Between 0.6–0.8 is acceptable but worth monitoring. Below 0.6, the retriever is probably pulling in noise. Log these scores for every query in production so you can spot degradation over time.

You should also check for false negatives — queries where the correct chunk exists in your store but doesn't appear in the top-K results. This usually means the chunk's embedding doesn't capture the right semantics, or your chunk is too large and the relevant passage is buried in unrelated text.

A simple retrieval evaluation metric is Recall@K: for a set of test queries where you know which chunk should be retrieved, what percentage of the time does the correct chunk appear in the top K results? Aim for Recall@3 above 85%. If you're below that, focus on chunking and embedding quality before touching anything else.

Faithfulness

Does the generated answer actually use the retrieved context, or is the model ignoring it and hallucinating? This is the most critical evaluation dimension. A model that makes up plausible-sounding information is actively harmful — users trust it because it sounds confident.

Test this explicitly: retrieve chunks about Topic A, but ask about Topic B. If the model answers about Topic B (using training data instead of admitting the context doesn't cover it), your faithfulness constraint is too weak.

Answer quality

Is the answer correct, complete, and helpful? This is the end-to-end metric. Even with good retrieval and faithful generation, the answer might be poorly structured, miss important nuances, or be unnecessarily verbose.

Here's an evaluation function using LLM-as-judge:

typescript

import OpenAI from "openai";
import { SearchResult } from "./vector-store.js";
 
const openai = new OpenAI();
 
interface EvalResult {
  relevanceScore: number;
  faithfulnessScore: number;
  qualityScore: number;
  reasoning: string;
}
 
export async function evaluateResponse(
  query: string,
  answer: string,
  retrievedChunks: SearchResult[],
  referenceAnswer?: string
): Promise<EvalResult> {
  const context = retrievedChunks.map((r) => r.chunk.text).join("\n\n");
 
  const evalPrompt = `You are an evaluation judge for a RAG system. Score the following on a scale of 1-5.
 
Query: ${query}
 
Retrieved Context:
${context}
 
Generated Answer:
${answer}
${referenceAnswer ? `\nReference Answer: ${referenceAnswer}` : ""}
 
Score these three dimensions (1-5 each):
1. RELEVANCE: Are the retrieved chunks relevant to the query?
2. FAITHFULNESS: Does the answer only use information from the retrieved context? (5 = fully grounded, 1 = hallucinated)
3. QUALITY: Is the answer correct, complete, and helpful?
 
Respond in JSON format:
{"relevance": N, "faithfulness": N, "quality": N, "reasoning": "brief explanation"}`;
 
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    messages: [{ role: "user", content: evalPrompt }],
    response_format: { type: "json_object" },
  });
 
  const parsed = JSON.parse(response.choices[0].message.content ?? "{}");
 
  return {
    relevanceScore: parsed.relevance ?? 0,
    faithfulnessScore: parsed.faithfulness ?? 0,
    qualityScore: parsed.quality ?? 0,
    reasoning: parsed.reasoning ?? "",
  };
}
 
// Usage: add this to your main() function
// const evalResult = await evaluateResponse(query, answer, results);
// console.log(`  Eval: R=${evalResult.relevanceScore} F=${evalResult.faithfulnessScore} Q=${evalResult.qualityScore}`);
// console.log(`  Reasoning: ${evalResult.reasoning}`);

This uses LLM-as-judge evaluation — the same approach used by RAG evaluation frameworks like RAGAS and DeepEval. The judge can be wrong (it's an LLM too), but it's the fastest way to get automated quality signals at scale.

A few notes on making evaluation useful in practice:

Build a test set. Curate 50–100 question-answer pairs that cover your key use cases. Run them through the pipeline after every change to catch regressions.
Track scores over time. A sudden drop in average faithfulness score after you change your chunking strategy is a clear signal something went wrong.
Use reference answers. When you have gold-standard answers, pass them as referenceAnswer to give the judge something to compare against.
Don't trust the judge blindly. Spot-check its evaluations manually. If the judge consistently rates faithfulness 5/5 when you can see the model is hallucinating, your eval prompt needs work.

For a production evaluation harness with regression testing and CI integration, see our guide on building an eval framework from scratch. In production, combine automated LLM-as-judge scoring with human evaluation on a sample of queries and analytics dashboards tracking quality metrics over time. The automated scores catch regressions fast; the human reviews catch the subtle failures that automated scoring misses.

Common failure modes

Once you have a working pipeline, here's what typically breaks — and how to fix it.

Retriever finds irrelevant chunks. Your chunks are too large, or your embeddings don't capture the right semantics. Fix: smaller chunks, try a better embedding model, or add metadata filtering to scope searches by document category. If a user asks about billing but your retriever pulls in onboarding docs, metadata filters that restrict search to "billing" documents would solve this immediately.

Model ignores the context. This usually means your prompt isn't constraining the model enough, or the context is so long that the model "loses" the relevant information in the middle. Research on "lost in the middle" has shown that LLMs attend more to the beginning and end of their context window. Fix: tighten your system prompt, reduce the number of retrieved chunks, or put the most relevant chunk last.

Answers are correct but miss important details. Your top-K is too low, or the relevant information is spread across chunks that don't get retrieved together. Fix: increase top-K from 3 to 5, add a reranker to promote better chunks from a larger initial retrieval set (retrieve 20, rerank, keep 3), or try larger chunks with more overlap so related information stays together.

Performance is slow. Embedding the query + searching + generating takes too long for interactive use. Fix: cache frequently-asked-question embeddings so you skip the embedding step for repeated queries, use a vector database with ANN indexing instead of brute-force search, and consider a smaller or faster generation model for lower-stakes queries.

Answers hallucinate beyond the context. The model fills in gaps with training data even when told not to. Fix: lower the temperature to 0.1, make the system prompt more explicit about refusing to answer when context is insufficient, and add a faithfulness evaluation step that flags or blocks low-scoring responses before they reach users.

Stale documents produce wrong answers. Your knowledge base has been updated but the embeddings haven't been regenerated. Fix: build a re-indexing pipeline that watches for document changes and re-embeds affected chunks. Track the last-indexed timestamp per document so you can verify freshness. This is especially dangerous for time-sensitive content like pricing, policies, or compliance documents where an outdated answer could have real consequences.

Duplicate or near-duplicate chunks dominate results. If the same information appears in multiple documents (e.g., the refund policy is mentioned in both the FAQ and the terms of service), your top-K results might all contain the same information, pushing out other relevant context. Fix: deduplicate at indexing time by checking cosine similarity between new chunks and existing ones, or add diversity to your retrieval by penalizing chunks that are too similar to already-selected results (maximal marginal relevance).

Production considerations

Moving from this tutorial to a production RAG system involves a few additional concerns that are worth thinking about early, even if you don't implement them right away.

Hybrid search. Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid search combines both — run a BM25 keyword search and a vector search in parallel, then merge the results. Most production RAG systems use this approach. Weaviate and pgvector both support it natively.

Reranking. Vector search is fast but approximate. A cross-encoder reranker takes the top 20 results from vector search and re-scores them using a more accurate (but slower) model. The reranker sees both the query and the document together (not just their independent embeddings), so it can catch nuances that embedding comparison misses — like negation, qualifier words, or subtle distinctions. Cohere's Rerank API and open-source models like bge-reranker-v2-m3 are popular choices. The typical latency overhead is 100-300ms, which is worth it when precision matters.

Query routing. Not every question needs RAG. "What's 2+2?" doesn't require document retrieval. A lightweight classifier can route simple questions directly to the LLM and only invoke the RAG pipeline for questions that need domain knowledge. This reduces latency and cost.

Multi-turn context. In a conversation, the user's second question often references the first. "Tell me about pricing" followed by "What about the enterprise tier?" — the second query alone doesn't mention pricing. If you embed it as-is, the retriever won't know to look for pricing documents. Fix: use the LLM to rewrite the query into a standalone form ("What is the pricing for the enterprise tier?") before embedding it. This is called "query rewriting" or "question condensation" and adds one LLM call per turn but dramatically improves multi-turn retrieval accuracy.

Access control. If different users should see different documents, you need to filter results by permission at query time. Tag each chunk with the access groups that should see it, then add a filter to your vector search. This is straightforward with metadata filtering in Pinecone or pgvector, but easy to forget until someone retrieves a document they shouldn't have access to.

Document versioning. Documents change over time. Your 2024 return policy might differ from your 2026 one. If you just overwrite chunks, old queries might get stale information. A common pattern: store a version or timestamp with each chunk, and during re-indexing, insert new chunks before deleting old ones so there's no gap in coverage.

Error handling and fallbacks. In production, the embedding API might be slow or unavailable. Your vector store might time out. The generation model might return an error. Build graceful degradation: if embeddings fail, fall back to keyword search. If retrieval returns low-confidence results (all scores below 0.5), skip generation and return "I couldn't find relevant information." If generation fails, return the raw chunks with a note that the answer couldn't be synthesized.

Monitoring and observability. Log every query along with: the retrieved chunks, their similarity scores, the generated answer, and the latency of each step (embedding, retrieval, generation). This gives you a complete picture of pipeline health. Alert on: average retrieval score dropping below a threshold, generation latency spiking, or the proportion of "I don't know" answers increasing. Over time, these logs also become your eval dataset — real user queries with real retrieved results that you can use to improve chunking, embeddings, and prompts.

Caching. Two layers of caching can dramatically reduce cost and latency. First, cache embedding results for repeated queries — if 100 users ask "What's your refund policy?", you only need to embed it once. Second, cache full RAG responses for identical queries. The cache key is the query text (or its hash), and you invalidate it when the underlying documents change.

Next steps

You now have a working RAG pipeline that you understand end to end. Every piece is visible, every decision is yours. From here, the natural progressions are:

Swap the vector store for Chroma or pgvector and test with a larger document corpus — our sample documents are tiny, and real-world performance depends on scale
Add a reranker — retrieve top-20 with vector search, rerank with a cross-encoder model, keep top-3. This dramatically improves precision when your initial retrieval is noisy
Implement hybrid search — combine vector similarity with keyword matching (BM25). Vector search handles semantic similarity; BM25 handles exact keyword matches. Together they catch cases that either misses alone
Add metadata filtering — tag chunks with source, date, category, and filter during retrieval. A user asking about "2026 pricing" shouldn't get chunks from your 2024 pricing page
Try query expansion — rewrite the user's query into multiple forms before searching. "How do I get a refund?" could also search for "return policy" and "money back guarantee"
Build a proper eval suite — create a test set of question-answer pairs and run them against your pipeline automatically. Track retrieval precision and answer quality over time so you catch regressions before users do

Each of these improvements is independent — you can add a reranker without changing your chunking, or switch vector databases without touching your generation code. That's the advantage of understanding each piece: you know exactly which component to upgrade when you hit a specific limitation.

RAG is the foundation that makes AI agents actually useful with your data. Whether you're building a customer support agent, an internal knowledge assistant, or any system that needs to answer questions from a specific corpus, the pattern is the same: chunk, embed, retrieve, generate. Everything else is optimization on top of these four operations, and now you understand exactly what each one does and why.

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

rag embeddings vector-search python typescript learning-ai

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.