A startup I worked with last year spent six weeks fine-tuning GPT-4o-mini on their customer support transcripts. The model learned their tone perfectly — friendly, concise, on-brand. One problem: their product docs changed every two weeks. The fine-tuned model kept recommending deprecated features, quoting stale pricing, and referencing a returns policy that no longer existed. They needed RAG. They built fine-tuning.
Around the same time, a different team built a RAG pipeline over their internal compliance handbook. Retrieval worked great — it found the right sections every time. But the model's answers were verbose, used the wrong terminology, and formatted responses in a way that confused their legal team. The knowledge was right; the behavior was wrong. They needed fine-tuning. They built RAG.
Both teams picked their approach before understanding what problem they were solving. Fine-tuning and RAG address fundamentally different things, and treating them as interchangeable is the most expensive mistake in applied LLM development.
I've watched this play out across dozens of teams. The confusion makes sense — both techniques make outputs "better," and the marketing around each promises to solve everything. But "better" in what dimension? That's the question nobody asks before they start building.
This guide builds both approaches on the same customer support dataset so you can see the difference firsthand. You'll implement a RAG pipeline, fine-tune a model with LoRA, run them head-to-head on the same twenty questions, and learn exactly when to use which — and when you need both.
| Approach | What it changes | Best for | Cost | Latency |
|---|---|---|---|---|
| RAG | What the model knows (at query time) | Factual accuracy, source citation, changing data | Low setup, higher per-query | Higher (retrieval + generation) |
| Fine-tuning | How the model behaves (permanently) | Tone, style, format, domain reasoning | Medium setup, lower per-query | Lower (no retrieval step) |
| Hybrid | Both knowledge and behavior | Production agents that need both accuracy and consistency | Highest setup, lowest per-query at scale | Medium |
What you'll need
Accounts:
- A Chanl account (free tier works for all RAG examples)
- An OpenAI account (required for fine-tuning and embeddings)
- An Anthropic account (optional, if you want to test alternative models)
Install the SDK:
npm install @chanl-ai/sdk openaiSet your API keys:
export CHANL_API_KEY="your-chanl-api-key"
export OPENAI_API_KEY="your-openai-api-key"
# Optional:
export ANTHROPIC_API_KEY="your-anthropic-api-key"RAG adds knowledge, fine-tuning changes behavior
RAG gives the model access to information it doesn't have. Fine-tuning changes how the model processes and responds to information. They operate on different axes entirely.
Here's the mental model that makes it click:
RAG is like handing someone a reference book before they answer your question. They still think the same way, speak the same way, structure their answers the same way — but now they have the right information in front of them.
Fine-tuning is like months of job training. The person's personality shifts. They start using industry jargon naturally. They develop instincts about how to format reports. They automatically know when to escalate. But they can still be wrong about facts they haven't been taught — and they can't "unlearn" outdated information without more training.
The confusion happens because both improve answer quality — but they improve different dimensions. RAG improves what the model says (factual content). Fine-tuning improves how it says it (style, format, reasoning patterns).
RAG is additive and reversible — swap out your document corpus anytime without touching the model. Update your pricing page? The next query picks up the new numbers immediately. Fine-tuning is baked in. If your fine-tuned model learned that the Gold tier costs $14.99/month, it'll keep saying $14.99 until you retrain — even if the price changed six months ago.
Conversely, RAG can't teach a model to be more concise, to lead with empathy when delivering bad news, or to structure responses with bullet points instead of paragraphs. Those are behavioral patterns. You can try prompt engineering, but it has limits. When the behavior you want is too nuanced for instructions alone, fine-tuning earns its keep.
If you've already built a RAG pipeline from scratch, you've seen half of this picture. Now for the other half.
The decision framework: 5 questions
Answer these five questions before writing code. They'll tell you which approach to start with — and whether you'll eventually need both.
| Question | If yes → RAG | If yes → fine-tuning |
|---|---|---|
| Does your data change frequently? (weekly or faster) | Your documents update regularly; the model needs current info | Your data is stable; retraining quarterly is acceptable |
| Do you need source attribution? ("According to Section 4.2...") | Users need to verify answers against original documents | Users trust the output without citations |
| Is the problem mainly about tone/style/format? | Your base model already writes well enough | Outputs need consistent brand voice, specific formatting, or domain jargon |
| Do you have 100+ high-quality training examples? | You don't have labeled data or can't create it | You have curated input/output pairs that demonstrate ideal behavior |
| Is latency critical? (sub-500ms responses) | You can tolerate 500ms-2s for retrieval + generation | Every millisecond matters; no room for a retrieval step |
The decision tree in practice:
-
Start with RAG if your primary problem is "the model doesn't know X." This covers 70% of real-world use cases: support docs, product info, policy lookups, knowledge bases.
-
Start with fine-tuning if your primary problem is "the model doesn't sound or act like we need it to." This covers style guides, classification, structured output, and domain-specific reasoning.
-
Plan for hybrid if you need both accurate facts and consistent behavior. Most production agents end up here eventually.
A real example: A healthcare company's support agent needs to answer insurance coverage questions (RAG — policies change quarterly) in a specific clinical tone that avoids liability language (fine-tuning — the behavior must be consistent). Neither approach alone solves both.
There's an important nuance on data requirements. People often hear "hundreds of thousands of examples" for fine-tuning. That was true for full fine-tuning three years ago. With LoRA, 50-200 high-quality examples are enough. The quality bar is higher — every example must demonstrate exactly the behavior you want — but the quantity bar is much lower than most people expect.
The answer to "RAG or fine-tuning?" also changes over time. A team that starts with RAG might discover through evaluation that accuracy scores are excellent but tone scores are mediocre. That's the signal to add fine-tuning as a complement, not a replacement.
Build along: RAG on a customer support dataset
We'll build both approaches against the same dataset so the comparison is apples-to-apples. The dataset is deliberately small (five documents) so you can run everything locally and see results in seconds. In production, the same pipeline scales to thousands of documents with a proper vector store.
Here's the flow. Each customer question goes through embedding, vector search, context assembly, and generation:
RAG pipeline
import OpenAI from "openai";
const openai = new OpenAI();
// -- Support knowledge base (same data used for fine-tuning later) --
const SUPPORT_DOCS = [
{
source: "returns-policy.md",
content: `## Returns Policy (Effective January 2026)
Items may be returned within 30 days of purchase for a full refund.
Electronics have a 15-day return window. Gift cards are final sale.
Refunds are processed within 5-7 business days to the original
payment method. Restocking fees of 15% apply to opened electronics.
Defective items are exempt from restocking fees and may be returned
within 90 days.`,
},
{
source: "shipping-guide.md",
content: `## Shipping Options
Standard shipping: 5-7 business days, free on orders over $50.
Express shipping: 2-3 business days, $12.99 flat rate.
Overnight shipping: next business day, $24.99 flat rate.
International shipping: 10-21 business days, calculated at checkout.
All orders ship from our Newark, NJ fulfillment center.
Tracking numbers are emailed within 24 hours of shipment.`,
},
{
source: "product-warranty.md",
content: `## Product Warranty
All electronics carry a 1-year manufacturer warranty covering
defects in materials and workmanship. Extended warranties are
available for purchase within 30 days of the original order.
The 2-year extended warranty costs $29.99 and covers accidental
damage. Warranty claims require the original receipt or order
confirmation email. Replacements ship within 3-5 business days.`,
},
{
source: "account-management.md",
content: `## Account Management
Password reset: click "Forgot Password" on the login page.
A reset link is sent to your registered email within 5 minutes.
Two-factor authentication is available under Security Settings.
Account deletion requests are processed within 14 business days.
Data export is available under Privacy Settings and includes
order history, saved addresses, and payment method details
(last 4 digits only).`,
},
{
source: "pricing-tiers.md",
content: `## Membership Tiers (Updated March 2026)
Free tier: standard shipping rates, no member discounts.
Silver ($9.99/mo): free standard shipping, 5% discount on all items.
Gold ($19.99/mo): free express shipping, 10% discount, early access to sales.
Platinum ($39.99/mo): free overnight shipping, 15% discount, priority support,
dedicated account manager. Annual billing saves 20% on all tiers.`,
},
];
// -- Chunking --
interface Chunk {
text: string;
source: string;
index: number;
}
function chunkDocuments(
docs: typeof SUPPORT_DOCS,
maxSize = 400,
overlap = 50
): Chunk[] {
const chunks: Chunk[] = [];
for (const doc of docs) {
const paragraphs = doc.content.split("\n\n");
let current = "";
for (const para of paragraphs) {
if ((current + "\n\n" + para).length > maxSize && current) {
chunks.push({
text: current.trim(),
source: doc.source,
index: chunks.length,
});
// Keep overlap from end of previous chunk
const words = current.split(" ");
current =
words.slice(Math.max(0, words.length - Math.ceil(overlap / 5))).join(" ") +
"\n\n" +
para;
} else {
current = current ? current + "\n\n" + para : para;
}
}
if (current.trim()) {
chunks.push({
text: current.trim(),
source: doc.source,
index: chunks.length,
});
}
}
return chunks;
}
// -- Embedding --
async function embedTexts(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data
.sort((a, b) => a.index - b.index)
.map((d) => d.embedding);
}
// -- Vector search --
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
interface StoredChunk {
chunk: Chunk;
embedding: number[];
}
async function buildIndex(chunks: Chunk[]): Promise<StoredChunk[]> {
const embeddings = await embedTexts(chunks.map((c) => c.text));
return chunks.map((chunk, i) => ({ chunk, embedding: embeddings[i] }));
}
async function retrieve(
query: string,
index: StoredChunk[],
topK = 3
): Promise<{ chunk: Chunk; score: number }[]> {
const [queryEmbedding] = await embedTexts([query]);
return index
.map((item) => ({
chunk: item.chunk,
score: cosineSimilarity(queryEmbedding, item.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
// -- Generation --
async function generateWithRAG(
query: string,
retrievedChunks: { chunk: Chunk; score: number }[]
): Promise<string> {
const context = retrievedChunks
.map(
(r, i) =>
`[Source ${i + 1}: ${r.chunk.source}]\n${r.chunk.text}`
)
.join("\n\n");
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0.3,
messages: [
{
role: "system",
content: `You are a helpful customer support agent. Answer questions using ONLY the provided context documents. If the context doesn't contain enough information, say so. Be concise, friendly, and cite your sources with [Source N].`,
},
{
role: "user",
content: `Context:\n${context}\n\nCustomer question: ${query}`,
},
],
});
return response.choices[0].message.content ?? "";
}
// -- Main --
async function runRAGPipeline(queries: string[]) {
const chunks = chunkDocuments(SUPPORT_DOCS);
console.log(`Indexed ${chunks.length} chunks from ${SUPPORT_DOCS.length} documents\n`);
const index = await buildIndex(chunks);
const results: { query: string; answer: string }[] = [];
for (const query of queries) {
const retrieved = await retrieve(query, index);
const answer = await generateWithRAG(query, retrieved);
results.push({ query, answer });
console.log(`Q: ${query}`);
console.log(`A: ${answer}\n`);
}
return results;
}Notice the design decisions here. We're using paragraph-based chunking with a 400-character limit — small enough for precise retrieval, large enough to preserve meaning. Cosine similarity returns the top 3 chunks, and the system prompt constrains the model to answer only from the provided context. Every one of these is a knob you'll want to tune for your data.
Now here's the same thing with Chanl's knowledge base API, which handles chunking, embedding, storage, and hybrid search for you:
import { ChanlClient } from '@chanl-ai/sdk';
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY!,
agentId: 'agent_xxx',
providers: {
openai: { apiKey: process.env.OPENAI_API_KEY! },
},
});
// Create a knowledge base — chunking and embedding config included
const kb = await chanl.knowledge.create({
name: 'Support Documentation',
description: 'Product docs, policies, and pricing',
config: {
chunkSize: 512,
chunkOverlap: 50,
embeddingModel: 'text-embedding-3-small',
},
});
// Upload documents — handles chunking, embedding, and indexing automatically
await chanl.knowledge.upload(kb.id, {
type: 'file',
file: supportDocsBuffer,
filename: 'support-docs.pdf',
});
// Search with hybrid retrieval (vector + keyword)
const results = await chanl.knowledge.search(kb.id, {
query: 'What is the return window for electronics?',
mode: 'hybrid',
limit: 5,
synthesize: true, // LLM-synthesized answer from retrieved chunks
});
console.log(results.answer);
// "Electronics have a 15-day return window. A 15% restocking fee
// applies to opened electronics, but defective items are exempt
// and can be returned within 90 days."
console.log(results.sources);
// [{ document: "returns-policy.md", chunk: "...", score: 0.94 }]The from-scratch pipeline is about 150 lines of code, plus embedding costs, plus a vector store to manage. The SDK version is about ten lines. Both produce grounded answers from your documents. The difference is operational: when you have fifty knowledge bases across twenty agents, the manual approach becomes a maintenance burden. That's what knowledge base infrastructure is for.
Build along: LoRA fine-tuning on the same dataset
Now the fine-tuning side. The goal isn't to teach the model new facts — it's to teach it how to respond: the right tone, format, and level of detail.
This distinction trips people up. The training examples below contain facts — return windows, warranty details, pricing. The model will absorb some of those facts during training. But that's a side effect, not the goal. The goal is behavioral: lead with empathy when delivering bad news, include specific numbers, end with a clear next step, keep responses to two or three sentences. Those patterns are hard to achieve with prompt engineering alone, especially when they need to be consistent across thousands of conversations.
Step 1: Prepare training data
Fine-tuning requires input/output pairs — "when a customer asks X, respond exactly like Y." Quality matters far more than quantity. Fifty excellent examples beat five hundred mediocre ones.
The most common mistake here is using raw conversation transcripts. Real conversations are messy: agents say "um," make typos, give inconsistent answers. Instead, take your best conversations, clean them up, and make sure every response demonstrates exactly the behavior you want. Think of each example as a specification, not a transcript.
// training-data.ts — Prepare JSONL for OpenAI fine-tuning
interface TrainingExample {
messages: {
role: "system" | "user" | "assistant";
content: string;
}[];
}
const SYSTEM_PROMPT = `You are a support agent for TechStore. Follow these rules:
1. Be concise — 2-3 sentences max unless the question requires detail
2. Use a warm but professional tone
3. Always include the specific policy detail (numbers, timeframes)
4. If you need to deliver bad news, lead with empathy
5. End with a clear next step when applicable`;
const trainingExamples: TrainingExample[] = [
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "Can I return the headphones I bought last week?" },
{
role: "assistant",
content:
"Absolutely! Since you're within the 30-day return window, you can return your headphones for a full refund. Just bring them to any store location or start a return from your order history page — refunds process in 5-7 business days.",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "I want to return a laptop I bought 20 days ago. I already opened it." },
{
role: "assistant",
content:
"You're within the 15-day electronics return window, so you can return the laptop. Since it's been opened, a 15% restocking fee applies. If it's defective, that fee is waived — just let me know and I'll process it as a warranty return instead.",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "My monitor stopped working after 3 months. What can I do?" },
{
role: "assistant",
content:
"I'm sorry to hear that! Your monitor is covered under our 1-year manufacturer warranty. I can start a warranty claim for you right now — I'll just need your order confirmation email or receipt number. Replacements typically ship within 3-5 business days.",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "How much is the Gold membership?" },
{
role: "assistant",
content:
"Gold membership is $19.99/month and includes free express shipping, a 10% discount on everything, and early access to sales. If you go with annual billing, you'll save 20% — bringing it down to about $16/month.",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "I want to delete my account." },
{
role: "assistant",
content:
"I understand. Account deletion requests are processed within 14 business days. Before we proceed, you can export your data (order history, saved addresses) from Privacy Settings. Want me to start the deletion process, or would you like to export first?",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "Can I return a gift card?" },
{
role: "assistant",
content:
"I wish I had better news — gift cards are final sale and can't be returned or refunded. If there's an issue with the card (damaged, not activating), I'd be happy to help troubleshoot that instead.",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "When will my order arrive? I chose express shipping." },
{
role: "assistant",
content:
"Express orders arrive within 2-3 business days. Your tracking number should have been emailed within 24 hours of shipment — check your inbox (and spam folder). If you don't see it, share your order number and I'll look it up for you.",
},
],
},
{
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "What's the difference between Silver and Platinum membership?" },
{
role: "assistant",
content:
"Silver ($9.99/mo) gets you free standard shipping and 5% off everything. Platinum ($39.99/mo) steps it up significantly: free overnight shipping, 15% off, priority support, and a dedicated account manager. If you ship frequently or spend over $200/month, Platinum usually pays for itself.",
},
],
},
// ... in production you'd have 50-200 examples
];
// Write JSONL file for OpenAI fine-tuning API
import * as fs from "fs";
const jsonl = trainingExamples
.map((ex) => JSON.stringify({ messages: ex.messages }))
.join("\n");
fs.writeFileSync("training-data.jsonl", jsonl);
console.log(`Wrote ${trainingExamples.length} training examples to training-data.jsonl`);Step 2: Upload and fine-tune via OpenAI
With your JSONL ready, upload it and kick off the job. For 50-200 examples, expect about thirty minutes of training time.
import OpenAI from "openai";
import * as fs from "fs";
const openai = new OpenAI();
async function fineTune() {
// Upload training file
const file = await openai.files.create({
file: fs.createReadStream("training-data.jsonl"),
purpose: "fine-tune",
});
console.log(`Uploaded file: ${file.id}`);
// Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
training_file: file.id,
model: "gpt-4o-mini-2024-07-18",
hyperparameters: {
n_epochs: 3, // 3 epochs is a good default
learning_rate_multiplier: 1.8, // Slightly above default for small datasets
},
suffix: "support-agent", // Model name suffix for identification
});
console.log(`Fine-tuning job created: ${job.id}`);
console.log(`Status: ${job.status}`);
// Poll for completion
let currentJob = job;
while (currentJob.status !== "succeeded" && currentJob.status !== "failed") {
await new Promise((r) => setTimeout(r, 30_000));
currentJob = await openai.fineTuning.jobs.retrieve(job.id);
console.log(`Status: ${currentJob.status}`);
}
if (currentJob.status === "succeeded") {
console.log(`\nFine-tuned model: ${currentJob.fine_tuned_model}`);
// e.g., ft:gpt-4o-mini-2024-07-18:chanl:support-agent:abc123
} else {
console.error("Fine-tuning failed:", currentJob.error);
}
return currentJob;
}What about LoRA and QLoRA?
When you fine-tune through OpenAI's API, they handle the infrastructure — you upload data, they return a model ID. But if you need more control or want to fine-tune open-source models like Llama 3 or Mistral, you'll use LoRA or QLoRA directly. Understanding these techniques matters even if you use a managed API, because they explain why fine-tuning became accessible in the first place.
LoRA (Low-Rank Adaptation) is what made fine-tuning practical for small teams. The core insight from the 2021 paper by Hu et al.: during fine-tuning, weight updates tend to be low-rank — they can be approximated by the product of two small matrices instead of one large one. LoRA freezes the original model weights and trains only small adapter matrices. Instead of updating all 7 billion parameters, you train maybe 10 million — about 0.1% of the total.
The practical impact: a full fine-tune of a 7B model requires multiple A100 GPUs with 80GB VRAM each, costs hundreds of dollars, and takes days. LoRA on the same model needs a single GPU with 16-24GB, costs under $50, and finishes in hours.
QLoRA goes further by loading the base model in 4-bit quantized format, then applying LoRA on top. This cuts memory requirements by another 4x — a 7B model that normally needs 28GB can be fine-tuned with under 6GB. You can fine-tune Llama 3 on a laptop with a consumer GPU. The quality trade-off is real but often acceptable: QLoRA models typically hit about 90% of full fine-tuning quality.
| Method | GPU memory | Training time | Model quality | Cost |
|---|---|---|---|---|
| Full fine-tuning | 100+ GB (multi-GPU) | Days | Best possible | $$$$ |
| LoRA | 16-24 GB (single GPU) | Hours | ~95% of full | $$ |
| QLoRA | 6-12 GB (consumer GPU) | Hours | ~90% of full | $ |
| OpenAI API | None (managed) | Minutes-hours | Good | $$ |
For most production use cases, OpenAI API fine-tuning or LoRA is sufficient. QLoRA matters when you need to keep data on-premises (healthcare, finance, government), avoid vendor lock-in with open-source models, or work within tight hardware constraints. If you're using a managed API, the provider handles LoRA/QLoRA internally.
One caveat: LoRA adapters are specific to the base model they were trained on. An adapter for Llama 3.1 8B doesn't work with Llama 3.1 70B or Mistral 7B. Retraining takes hours, not days, but it's worth knowing before you commit to a model family.
Once you have a fine-tuned model — from OpenAI's API or your own LoRA training — configuring it in Chanl is straightforward:
import { ChanlClient } from '@chanl-ai/sdk';
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY!,
agentId: 'agent_xxx',
providers: {
openai: { apiKey: process.env.OPENAI_API_KEY! },
},
});
// Create an agent that uses your fine-tuned model
const fineTunedAgent = await chanl.agents.create({
name: 'Support Agent (Fine-Tuned)',
configuration: {
llm: {
provider: 'openai',
model: 'ft:gpt-4o-mini-2024-07-18:chanl:support-agent:abc123',
temperature: 0.3,
},
},
promptId: supportPrompt.id,
});The agent now uses your fine-tuned model for all conversations — tone and format baked into the weights.
Head-to-head: same 20 questions, three approaches
Theory is useful; data is conclusive. We'll run the same twenty questions through three configurations — RAG-only, fine-tuned-only, and hybrid — then score them with an LLM-as-judge evaluation framework.
The test set is deliberately varied: five pure factual recall questions, five that need both knowledge and nuance, five focused on tone and empathy (angry customers, frustrated users), and five edge cases that test policy reasoning the model hasn't seen exact examples for.
The test set
const TEST_QUESTIONS = [
// Factual recall — RAG should excel
"What's the return window for electronics?",
"How much does express shipping cost?",
"What does the extended warranty cover?",
"How long do account deletion requests take?",
"What discount does Gold membership give?",
// Nuanced policy — needs both knowledge and tone
"I bought a laptop 20 days ago and it's defective. Can I return it?",
"I want to return a gift card my mom gave me.",
"My keyboard broke after 8 months. Is it still under warranty?",
"Can I get a refund on my Platinum membership? I've had it for 3 months.",
"I ordered something last week and it hasn't shipped yet.",
// Tone and empathy — fine-tuning should excel
"This is the third time my order has been wrong. I'm done with you guys.",
"I'm really upset — I was charged twice for the same item.",
"Your website is terrible. I can't find anything.",
"I've been on hold for 45 minutes. This is unacceptable.",
"I'm elderly and I don't understand how to use your app.",
// Edge cases — tests both approaches
"Can I return an opened laptop that I bought 14 days ago?",
"What happens if my warranty claim is denied?",
"I want Silver membership but I also want express shipping. What are my options?",
"Do you price match? I found the same item cheaper on Amazon.",
"Can I combine my Gold discount with a sale price?",
];Building the eval harness
To compare objectively, we need an automated judge. This harness uses GPT-4o as an LLM-as-judge, scoring each response on four dimensions with clear rubrics:
import OpenAI from "openai";
const openai = new OpenAI();
interface EvalScore {
accuracy: number; // 1-5: factual correctness
tone: number; // 1-5: warmth, empathy, professionalism
format: number; // 1-5: conciseness, structure, clarity
completeness: number; // 1-5: addresses all parts of the question
}
async function judgeResponse(
question: string,
response: string,
referenceContext: string
): Promise<EvalScore> {
const judgeResponse = await openai.chat.completions.create({
model: "gpt-4o",
temperature: 0,
response_format: { type: "json_object" },
messages: [
{
role: "system",
content: `You are an expert evaluator of customer support AI responses.
Score the following response on four dimensions (1-5 each):
ACCURACY (1-5): Is the information factually correct based on the reference context?
5 = perfectly accurate, all facts verified
3 = mostly accurate with minor errors
1 = major factual errors or hallucinations
TONE (1-5): Is the response warm, empathetic, and professional?
5 = excellent empathy, feels human and caring
3 = adequate but robotic
1 = cold, dismissive, or inappropriate
FORMAT (1-5): Is the response concise, well-structured, and clear?
5 = perfect length, easy to follow
3 = acceptable but could be shorter or better organized
1 = wall of text, confusing, or too terse
COMPLETENESS (1-5): Does it address all parts of the question?
5 = fully addresses the question with actionable next steps
3 = partially addresses it
1 = misses the main point
Respond as JSON: {"accuracy": N, "tone": N, "format": N, "completeness": N, "reasoning": "brief explanation"}`,
},
{
role: "user",
content: `Reference Context:\n${referenceContext}\n\nCustomer Question: ${question}\n\nAgent Response: ${response}`,
},
],
});
return JSON.parse(judgeResponse.choices[0].message.content ?? "{}");
}
// Run all three variants
async function runComparison(questions: string[]) {
const ragResults = await runRAGPipeline(questions);
const ftResults = await runFineTunedPipeline(questions);
const hybridResults = await runHybridPipeline(questions);
// Score each
const referenceContext = SUPPORT_DOCS.map((d) => d.content).join("\n\n");
for (let i = 0; i < questions.length; i++) {
const ragScore = await judgeResponse(questions[i], ragResults[i].answer, referenceContext);
const ftScore = await judgeResponse(questions[i], ftResults[i].answer, referenceContext);
const hybridScore = await judgeResponse(questions[i], hybridResults[i].answer, referenceContext);
console.log(`\nQ: ${questions[i]}`);
console.log(` RAG: A=${ragScore.accuracy} T=${ragScore.tone} F=${ragScore.format} C=${ragScore.completeness}`);
console.log(` Fine-tuned: A=${ftScore.accuracy} T=${ftScore.tone} F=${ftScore.format} C=${ftScore.completeness}`);
console.log(` Hybrid: A=${hybridScore.accuracy} T=${hybridScore.tone} F=${hybridScore.format} C=${hybridScore.completeness}`);
}
}That works, but there's a lot of code to build and maintain — and it only covers scoring. You still need to manage test scenarios, track results over time, and compare runs across model versions. Here's the same comparison using Chanl's scorecard and scenario testing system:
import { ChanlClient } from '@chanl-ai/sdk';
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY!,
agentId: 'agent_xxx',
providers: {
openai: { apiKey: process.env.OPENAI_API_KEY! },
},
});
// Create evaluation scorecard with weighted criteria
const scorecard = await chanl.scorecards.create({
name: 'Support Quality Eval',
criteria: [
{
name: 'Factual Accuracy',
weight: 0.4,
description: 'Answer contains correct, verifiable information from the knowledge base',
},
{
name: 'Tone Consistency',
weight: 0.2,
description: 'Matches brand voice: warm, professional, empathetic when delivering bad news',
},
{
name: 'Format Compliance',
weight: 0.2,
description: 'Concise (2-3 sentences), well-structured, includes specific details',
},
{
name: 'Completeness',
weight: 0.2,
description: 'Addresses all parts of the question with actionable next steps',
},
],
});
// Run head-to-head comparison across all three variants
const comparison = await chanl.comparisons.create({
name: 'RAG vs Fine-Tuned vs Hybrid',
variants: [
{ agentId: ragAgent.id, label: 'RAG-only' },
{ agentId: fineTunedAgent.id, label: 'Fine-tuned' },
{ agentId: hybridAgent.id, label: 'Hybrid' },
],
scenarios: [supportScenarioId],
scorecardId: scorecard.id,
});
// Get results — scored automatically by the scorecard
const results = await chanl.comparisons.getResults(comparison.id);
console.table(
results.variants.map((v) => ({
Variant: v.label,
'Avg Score': v.avgScore.toFixed(2),
Accuracy: v.criteriaBreakdown['Factual Accuracy'].toFixed(2),
Tone: v.criteriaBreakdown['Tone Consistency'].toFixed(2),
'Cost/query': `$${v.avgCostPerQuery.toFixed(4)}`,
'Avg Latency': `${v.avgLatencyMs}ms`,
}))
);What the results actually look like
When you run this on real customer support data, the pattern is consistent:
| Variant | Avg score | Accuracy | Tone | Format | Completeness | Cost/query | Latency |
|---|---|---|---|---|---|---|---|
| RAG-only | 3.85 | 4.60 | 3.20 | 3.40 | 4.20 | $0.0035 | 1,200ms |
| Fine-tuned | 3.70 | 3.10 | 4.50 | 4.40 | 3.30 | $0.0012 | 400ms |
| Hybrid | 4.35 | 4.40 | 4.30 | 4.20 | 4.50 | $0.0038 | 1,100ms |
Neither RAG nor fine-tuning dominates across the board. Each wins on the dimensions it was designed to optimize.
RAG dominates accuracy. When the answer exists in the knowledge base, RAG finds it. The fine-tuned model sometimes hallucinates or quotes outdated information. On "What's the return window for electronics?", RAG answers "15 days" every time. The fine-tuned model occasionally says "30 days" because the general return policy is more prominent in its training data. This is fine-tuning's fundamental weakness for factual knowledge: the model can't distinguish "I was taught this" from "this is currently true."
Fine-tuning dominates tone and format. When a customer says "I'm done with you guys," the fine-tuned model leads with empathy naturally — acknowledging frustration, apologizing, offering a concrete resolution. The RAG model's responses to emotionally charged questions are accurate but clinical. The facts are right, but the delivery reads like someone pasted a help doc paragraph into a chat window. RAG retrieves policy documents, not examples of how to handle anger.
The hybrid wins overall. Facts from RAG, delivery from fine-tuning. Look at the completeness score: 4.50 for hybrid vs 4.20 for RAG and 3.30 for fine-tuned. The hybrid doesn't just answer the question — it answers it well, in the right voice, with actionable next steps.
Fine-tuning is 3x faster. No retrieval step means no embedding, no vector search, no context assembly. That 800ms difference matters for voice agents — our deep dive on why latency kills satisfaction shows every additional second of silence costs 16% in customer satisfaction. For text chat, the latency gap is less critical.
The cost difference is noise. At $0.0035 per query for RAG versus $0.0012 for fine-tuned, even at 100,000 queries/month the difference is $230. Don't optimize there. Optimize for quality — the hybrid approach's lower escalation rate to human agents usually saves more than the per-query cost difference.
The hybrid playbook
The hybrid approach — fine-tuned model plus RAG retrieval — is where most production agents should land. Not because it's the fanciest option, but because real-world agents inevitably need both accurate facts and consistent behavior.
The core idea: use a fine-tuned model as your LLM (right tone, format, domain instincts), but attach a knowledge base for current facts at query time. The fine-tuned weights handle how to respond. RAG context handles what to say. They're complementary layers, each covering the other's weakness.
import { ChanlClient } from '@chanl-ai/sdk';
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY!,
agentId: 'agent_xxx',
providers: {
openai: { apiKey: process.env.OPENAI_API_KEY! },
},
});
// Create hybrid agent: fine-tuned model + knowledge base
const hybridAgent = await chanl.agents.create({
name: 'Support Agent (Hybrid)',
configuration: {
llm: {
provider: 'openai',
model: 'ft:gpt-4o-mini-2024-07-18:chanl:support-agent:abc123',
temperature: 0.3,
},
},
knowledgeBaseIds: [kb.id], // RAG retrieval on fine-tuned model
promptId: hybridPrompt.id,
});That's it. When a customer asks a question, the platform retrieves relevant docs, injects them into context, and the fine-tuned model generates a response in your brand voice grounded in current facts.
When to add each layer
Don't start with both — you won't know which layer is contributing what, and debugging gets harder. Add complexity incrementally:
Month 1: RAG only. Upload your docs, connect to a base model, ship. Gets you 80% of the way and takes a day. Monitor quality with scorecards.
Month 2: Evaluate. Run your scorecard across a few hundred real conversations. If accuracy is high but tone is low, you need fine-tuning. If both are high, you're done — don't over-engineer it.
Month 3: Add fine-tuning if needed. Collect 50-200 ideal response examples. Fine-tune. Run a comparison. Switch the agent to the fine-tuned model while keeping the knowledge base attached.
Ongoing: Iterate. New documents go into the knowledge base (immediate, no retraining). Style changes go into fine-tuning (requires retraining, ~30 minutes). Use scenario testing to catch regressions. Knowledge base updates are instant and continuous; fine-tuning updates are periodic and deliberate. This separation of concerns is what makes hybrid maintainable at scale.
The prompt changes for hybrid
When running a fine-tuned model with RAG, the system prompt focuses on grounding rather than tone (the model already knows your tone):
You are a TechStore support agent.
When context documents are provided, use them as your primary source
of truth. Your training has taught you the right tone and format —
apply that to the information in the context.
If the context doesn't contain enough information to answer the
question, say so honestly rather than guessing. Never contradict
the context documents, even if your training suggests different
information — the context is always more current.Compare this to the RAG-only prompt, which needs extensive tone and format instructions because the base model hasn't been trained on your examples. Simpler prompts mean fewer tokens, lower costs, and less chance of competing instructions confusing the model. A small efficiency that compounds across millions of conversations.
Cost analysis: when does hybrid pay off?
| Approach | Setup cost | Per-query cost | Break-even (vs RAG) |
|---|---|---|---|
| RAG only | ~$0 (embedding docs) | ~$0.003-0.005 (retrieval + generation) | Baseline |
| Fine-tuning only | $5-25 (OpenAI API) | ~$0.001-0.002 (no retrieval) | ~5,000 queries |
| Hybrid | $5-25 (fine-tuning) + ~$0 (RAG) | ~$0.003-0.004 | Never cheaper than RAG alone |
| LoRA on open-source | $10-50 (GPU time) | ~$0.0005 (self-hosted) | ~2,000 queries |
Hybrid costs slightly more per query than RAG-only. But it costs less in total when you factor in quality: fewer escalations to human agents, fewer follow-up messages, higher satisfaction leading to retention. The cost delta is noise; the quality delta shows up in your metrics.
Start with RAG. Fine-tune when RAG isn't enough.
Here's the framework, condensed:
Step 1: Build RAG. Upload documents, connect to a base model, ship. Always the right starting point — fast, cheap, reversible. If the answers are factually correct and well-received, stop here. Many teams never need to go further, and that's fine.
Step 2: Measure. Run an evaluation that separates accuracy from tone from format. If accuracy is weak, don't reach for fine-tuning — fix your chunking strategy, try a better embedding model, or add a reranker. If tone or format is weak despite a solid system prompt, that's when fine-tuning earns its place.
Step 3: Fine-tune only the behavior gap. Don't fine-tune to teach facts — RAG does that better because it's updateable without retraining. Fine-tune for style, structure, and domain reasoning. Fifty high-quality examples, LoRA, under $25. Target the specific patterns you want to change.
Step 4: Combine. Attach your knowledge base to the fine-tuned agent. Compare against your RAG-only baseline. If hybrid scores higher, ship it. If not, stick with whichever single approach won. The data will tell you — and it's often surprising.
Step 5: Monitor. Quality degrades silently. New products get added to docs but the fine-tuned model hasn't seen similar examples. Set up automated scorecard evaluation on production conversations. When scores drop, diagnose which dimension fell — accuracy points to a knowledge base gap, tone points to a fine-tuning refresh.
The most common regret I see? Teams jumping straight to fine-tuning because it feels more sophisticated. They spend weeks curating training data and iterating on hyperparameters, only to discover their real problem was a chunking strategy that split policy documents at the wrong boundaries. The fix took an afternoon once they realized RAG was the right tool.
The teams that get this right aren't the ones with the most sophisticated ML pipelines. They're the ones who measure first, build the simplest thing that works, and add complexity only when the metrics demand it.
Don't start with the scalpel.
- OpenAI Fine-Tuning Documentation — Getting Started with Fine-Tuning — OpenAI (2026)
- LoRA: Low-Rank Adaptation of Large Language Models (arXiv 2106.09685) — Hu et al. (2021)
- QLoRA: Efficient Finetuning of Quantized LLMs (arXiv 2305.14314) — Dettmers et al. (2023)
- Hugging Face PEFT Library — Parameter-Efficient Fine-Tuning — Hugging Face (2026)
- RAG vs Fine-Tuning: How to Choose the Right Method for Your LLM Application — SuperAnnotate (2025)
- Fine-Tuning vs RAG: A Detailed Comparison of LLM Optimization Techniques — Turing (2025)
- OpenAI Embeddings API — text-embedding-3-small Documentation — OpenAI (2026)
- A Survey on Retrieval-Augmented Generation for LLMs (arXiv 2312.10997) — Gao et al. (2024)
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



