How do you calibrate an LLM judge against human reviewers?

Run 50-100 test cases through both your LLM judge and a human reviewer, then calculate the correlation between their scores. If agreement drops below 80%, tighten your rubric anchors — add concrete examples of what each score level looks like. Re-calibrate quarterly or whenever you change models.

How many test cases do you actually need for reliable evals?

Start with 20-30 cases covering your core scenarios. That's enough to catch most regressions. Scale to 100+ only when you need per-category statistical significance — for example, when you're A/B testing prompt variants and need to detect small improvements.

Why do LLM-as-judge scores fluctuate between runs?

LLM outputs are non-deterministic even at low temperature. A borderline response might score 3 on one run and 4 on the next. Mitigate this by running each test case 3 times and taking the median, or by setting your pass threshold 0.3-0.5 points above your actual minimum.

How do you eval agents that use tools or external APIs?

Mock the tool responses for deterministic testing, then run a smaller set of end-to-end tests with real tools. Score both the agent's decision to use a tool (did it pick the right one?) and the final response quality. Track tool-selection accuracy as a separate eval criterion.

What's the cheapest way to run evals in CI without blowing your API budget?

Run your full suite only on PRs that touch prompt files or agent code (use path filters in GitHub Actions). For other PRs, run a 5-case smoke test. Batch test cases in parallel to reduce wall-clock time, and use a cheaper model for the judge when score correlation stays above 90%.

How should you handle eval cases where the 'correct' answer changes over time?

Tie test cases to versioned source documents (policy docs, product specs). When the source changes, update the affected test cases and re-baseline. Tag each test case with its source document version so stale cases surface automatically during review.

Why shouldn't you collapse all eval criteria into a single score?

A single score hides what's actually broken. An agent scoring 3.5 overall might have perfect accuracy but terrible empathy — or the reverse. Per-criterion scores tell you exactly what to fix, which makes prompt iteration faster and more targeted.

How to Evaluate AI Agents: Build an Eval Framework from Scratch

A team I know shipped a customer support agent after three days of manual testing — maybe forty questions, liked the answers, called it ready. Within a week, it was confidently quoting a deprecated refund policy. It told three customers they were eligible for full refunds on final-sale items. The agent didn't hallucinate or crash. It just gave plausible-sounding wrong answers, and nobody caught it for five days.

This guide walks you through building a real eval framework — one you can run before every deploy, catch regressions automatically, and actually trust. We'll build a working harness in TypeScript, cover all six eval strategies, and set you up with patterns that scale from side project to production.

What you'll learn	Why it matters
LLM-as-judge scoring	Automate quality assessment with structured rubrics instead of manual review
Multi-criteria rubrics	Score accuracy, tone, and completeness independently to pinpoint failures
Regression baselines	Catch prompt and model changes that silently degrade quality
CI integration	Block deploys automatically when eval scores drop below threshold
A/B prompt comparison	Compare prompt versions with numbers, not feelings
Multi-turn conversation evals	Test full dialogue flows, not just single Q&A pairs

Why AI agents need evals instead of traditional tests

Traditional tests are binary — a function returns the right value or it doesn't. AI agents aren't like that. Ask the same agent the same question twice and you'll get two different answers, both potentially correct, both phrased differently. "Did the agent do a good job?" isn't a boolean. It's a spectrum across accuracy, tone, completeness, and policy adherence.

Without evals, you'll hit predictable problems: prompt tweaks that fix billing questions but silently degrade shipping answers by 15%, model upgrades that break formatting your downstream systems depend on, and no way to compare approaches with anything better than gut feel. Evals give you numbers where you used to have vibes.

“Evaluation is the immune system of an AI application. Without it, every change is a potential infection you won't detect until the symptoms are obvious.”

Industry Observation — Common wisdom in production AI teams

What are the six types of AI agent evals?

Not all evaluations work the same way. Each type targets a different aspect of agent quality, and production systems typically combine several.

1. Exact match and heuristic evals

The simplest kind. Does the output contain a specific string? Is it valid JSON? Is it under a certain length?

typescript

function evalFormatting(output: string): boolean {
  // Must not contain internal system tags
  if (output.includes("[INTERNAL]") || output.includes("{{")) return false;
 
  // Must stay under 500 words
  if (output.split(/\s+/).length > 500) return false;
 
  // Must not quote a dollar amount without a disclaimer
  const hasDollar = /\$\d+/.test(output);
  const hasDisclaimer = /subject to change|may vary|contact.*for.*pricing/i.test(output);
  if (hasDollar && !hasDisclaimer) return false;
 
  return true;
}

Heuristic evals are fast, deterministic, and cheap. Use them as a first pass to catch obvious structural failures before spending money on LLM-as-judge scoring.

2. LLM-as-judge

The workhorse of modern eval frameworks. You use one LLM to evaluate another's output. The judge gets the original question, the agent's response, and a scoring rubric, then returns structured scores with reasoning.

The judge prompt matters enormously. A vague prompt produces inconsistent scores. A precise one with rubrics and examples produces scores that correlate strongly with human judgment.

text

Input: "What's your return policy?"
 
Agent output: "You can return any item within 30 days for a
full refund, no questions asked!"
 
Judge prompt:
  - Score ACCURACY (1-5): Is the information factually correct
    given the reference policy?
  - Score COMPLETENESS (1-5): Did the agent cover all relevant
    details (timeframe, conditions, exceptions)?
  - Score TONE (1-5): Was the response appropriately helpful
    without being misleading?
 
Judge output:
  accuracy: 3 (correct timeframe but omitted the
    "original packaging required" condition)
  completeness: 2 (missing restocking fee, packaging
    requirement, and gift card exception)
  tone: 4 (friendly and clear, slightly overpromises
    with "no questions asked")

3. Reference-based evals

You provide a "gold standard" answer and measure how close the agent's response is — not exact string matching, but semantic similarity or LLM-judged meaning comparison. Great for factual questions with clearly correct answers. Less useful for open-ended conversations where many different responses could be equally good.

4. Rubric-based evals

Instead of comparing against a reference answer, you define a rubric — a structured set of criteria with score levels. This is what you'll use most in practice. A rubric for a customer support agent might evaluate accuracy, empathy, policy adherence, and resolution effectiveness as separate dimensions.

The power here is decomposition. An agent can score 5/5 on accuracy while scoring 2/5 on empathy. That tells you exactly what to fix — something a single overall score never will. This is the same principle behind structured scorecard systems.

5. Human preference evals

Show a human two agent responses and ask which is better. Aggregate enough preferences and you get reliable rankings using Elo ratings or Bradley-Terry models.

Human preference evals are expensive and slow, but they're the gold standard for subjective quality. Use them to calibrate your automated evals: if your LLM judge consistently disagrees with human preferences, your judge prompt needs work.

6. End-to-end task completion

Did the agent actually accomplish the goal? Was the issue resolved? Was the reservation made correctly? Were the right fields populated?

Task completion evals often require integration with your actual systems — checking that a ticket was created or a correct API call was made. They're the most realistic eval type, but also the most involved to set up. For agents handling multi-step workflows, scenario-based testing lets you simulate entire conversations with personas and validate the end state.

How do you build an eval harness in TypeScript?

Here's a complete, runnable implementation. Four components: test case definitions, an agent runner, an LLM-as-judge scorer, and a report generator.

typescript

import Anthropic from "@anthropic-ai/sdk";
 
// — Types ---
 
interface TestCase {
  id: string;
  input: string;
  context?: string; // optional reference info for the judge
  criteria: string[]; // what the judge should evaluate
  expectedBehavior: string; // natural language description
}
 
interface EvalScore {
  criterion: string;
  score: number; // 1-5
  reasoning: string;
}
 
interface EvalResult {
  testCase: TestCase;
  agentOutput: string;
  scores: EvalScore[];
  averageScore: number;
  pass: boolean;
  latencyMs: number;
}
 
// — Agent Under Test ---
 
async function runAgent(
  client: Anthropic,
  systemPrompt: string,
  userMessage: string
): Promise<{ output: string; latencyMs: number }> {
  const start = Date.now();
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: "user", content: userMessage }],
  });
  const output =
    response.content[0].type === "text" ? response.content[0].text : "";
  return { output, latencyMs: Date.now() - start };
}
 
// — LLM-as-Judge ---
 
const JUDGE_PROMPT = `You are an expert evaluator for AI agent responses.
You will be given:
1. The user's input message
2. The agent's response
3. Context about what the correct behavior should be
4. A list of criteria to evaluate
 
For each criterion, provide:
- A score from 1-5 (1=terrible, 2=poor, 3=adequate, 4=good, 5=excellent)
- A brief reasoning explaining the score
 
Think step-by-step before scoring. Consider edge cases and subtle issues.
 
Respond in this exact JSON format:
{
  "scores": [
    {
      "criterion": "<criterion name>",
      "score": <1-5>,
      "reasoning": "<1-2 sentence explanation>"
    }
  ]
}`;
 
async function judgeResponse(
  client: Anthropic,
  testCase: TestCase,
  agentOutput: string
): Promise<EvalScore[]> {
  const message = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: JUDGE_PROMPT,
    messages: [
      {
        role: "user",
        content: `## User Input
${testCase.input}
 
## Agent Response
${agentOutput}
 
## Expected Behavior
${testCase.expectedBehavior}
 
${testCase.context ? "## Reference Context\n" + testCase.context : ""}
 
## Criteria to Evaluate
${testCase.criteria.map((c, i) => `${i + 1}. ${c}`).join("\n")}`,
      },
    ],
  });
 
  const text =
    message.content[0].type === "text" ? message.content[0].text : "{}";
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) throw new Error("Judge did not return valid JSON");
  const parsed = JSON.parse(jsonMatch[0]);
  return parsed.scores;
}
 
// — Test Runner ---
 
async function runEvals(
  testCases: TestCase[],
  systemPrompt: string,
  passThreshold: number = 3.5
): Promise<EvalResult[]> {
  const client = new Anthropic();
  const results: EvalResult[] = [];
 
  for (const testCase of testCases) {
    console.log(`Running: ${testCase.id}...`);
    const { output, latencyMs } = await runAgent(
      client, systemPrompt, testCase.input
    );
    const scores = await judgeResponse(client, testCase, output);
    const avg =
      scores.reduce((sum, s) => sum + s.score, 0) / scores.length;
 
    results.push({
      testCase,
      agentOutput: output,
      scores,
      averageScore: Math.round(avg * 100) / 100,
      pass: avg >= passThreshold,
      latencyMs,
    });
  }
 
  return results;
}
 
// — Report ---
 
function printReport(results: EvalResult[]): void {
  console.log("\n" + "=".repeat(60));
  console.log("EVALUATION REPORT");
  console.log("=".repeat(60));
 
  const passed = results.filter((r) => r.pass).length;
  console.log(`\nOverall: ${passed}/${results.length} passed\n`);
 
  for (const r of results) {
    const icon = r.pass ? "PASS" : "FAIL";
    console.log(`[${icon}] ${r.testCase.id} — avg: ${r.averageScore} (${r.latencyMs}ms)`);
    for (const s of r.scores) {
      console.log(`       ${s.criterion}: ${s.score}/5 — ${s.reasoning}`);
    }
    console.log();
  }
}
 
// — Test Cases ---
 
const SUPPORT_AGENT_PROMPT = `You are a customer support agent for TechCo.
Our return policy: 30-day returns with original packaging. Restocking
fee of 15% for opened electronics. Gift cards are final sale.
Business hours: Mon-Fri 9am-6pm EST.
Always be helpful, accurate, and empathetic.`;
 
const testCases: TestCase[] = [
  {
    id: "eval-001",
    input: "I bought a laptop 3 weeks ago and want to return it. I opened the box though.",
    context: "30-day return window. Opened electronics have 15% restocking fee.",
    criteria: ["Accuracy", "Completeness", "Empathy"],
    expectedBehavior:
      "Should confirm the return is within the 30-day window, mention the " +
      "15% restocking fee for opened electronics, and be empathetic.",
  },
  {
    id: "eval-002",
    input: "Can I return a gift card?",
    context: "Gift cards are final sale and cannot be returned.",
    criteria: ["Accuracy", "Tone", "Policy adherence"],
    expectedBehavior:
      "Should clearly state that gift cards are final sale and cannot be " +
      "returned. Should be empathetic but firm. Must not offer alternatives " +
      "that contradict the policy.",
  },
  {
    id: "eval-003",
    input: "Your product broke after 2 days! This is unacceptable!",
    context: "Defective items within 30 days get full refund, no restocking fee.",
    criteria: ["Empathy", "Accuracy", "De-escalation", "Resolution"],
    expectedBehavior:
      "Should acknowledge frustration, apologize, explain that defective items " +
      "qualify for full refund without restocking fee, and offer clear next steps.",
  },
  {
    id: "eval-004",
    input: "What are your hours? Also can I return something I bought 45 days ago?",
    context: "Hours: Mon-Fri 9-6 EST. Returns within 30 days only.",
    criteria: ["Accuracy", "Completeness", "Clarity"],
    expectedBehavior:
      "Should answer BOTH questions. State business hours correctly AND explain " +
      "that the 45-day return is outside the 30-day window. Must not skip either question.",
  },
];
 
// — Run ---
 
runEvals(testCases, SUPPORT_AGENT_PROMPT).then(printReport);

Install the SDK (npm install @anthropic-ai/sdk), set your ANTHROPIC_API_KEY environment variable, and run it with npx tsx eval-harness.ts.

Here's what each piece does:

Test cases define input, expected behavior, reference context, and specific criteria. Each criterion gets its own score — you're not collapsing everything into a single number.

The agent runner calls your LLM and captures both output and latency. In production, you'd swap this for a call to your actual agent API.

The LLM judge gets the test case, the response, and a rubric. It uses chain-of-thought reasoning before scoring, which significantly improves consistency. It returns structured JSON with per-criterion scores.

The report shows pass/fail with a detailed breakdown so you can see which criteria failed and why.

A note on judge prompt design

The judge prompt is the most important piece of your eval framework. Three principles:

Be specific about what each score level means. "Score 1-5" is too vague. Add anchored examples: "A score of 3 means the response is factually correct but incomplete. A score of 5 means the response is correct, complete, and proactively addresses likely follow-up questions."

Ask for reasoning before the score. When the judge explains its thinking first, scores are more consistent. This is chain-of-thought prompting applied to evaluation.

Use a strong model for judging. Your judge should be at least as capable as the model you're evaluating. A weaker judge produces unreliable results.

Which eval metrics actually matter?

Quality metrics

Accuracy — Is the response factually correct? Non-negotiable for production agents. Measure it per-response with LLM-as-judge scoring against known facts or reference documents.

Faithfulness — Does the response stay grounded in provided context? An agent that's "accurate" but draws on training data instead of your knowledge base is a liability. Faithfulness measures whether claims are supported by retrieved context, not just whether they happen to be true.

Relevance — Did the agent address what the user asked? An accurate, faithful response that doesn't answer the question is still a failure.

Completeness — Did the response cover everything it should? Missing the restocking fee when explaining return policy isn't inaccurate — it's incomplete. Different failure mode, different score.

Operational metrics

Latency — Track both p50 and p95. For conversational agents, anything over 3 seconds at p95 feels broken.

Cost per evaluation — If your full eval suite costs $50, you won't run it often enough. Optimize for pennies per run so you can execute on every PR.

Token usage — Track input and output tokens separately. Verbose agents cost more and often provide worse experiences.

Aggregate metrics

Pass rate — Percentage of test cases passing your threshold. Track it over time. A declining pass rate is an early warning signal.

Mean score by criterion — Average accuracy across all cases, average empathy, and so on. Shows which dimensions are strong and which need work.

Score variance — High variance means inconsistent behavior. Your agent might ace 8 out of 10 empathy tests but completely fail the other 2. Low averages are a systematic problem; high variance is a robustness problem.

Per-criterion averages across a test suite. De-escalation and completeness need work — even though the overall average looks acceptable.

How should you design your eval set?

Your eval set is the collection of test cases you run against your agent. Quality of coverage matters far more than quantity.

Coverage over volume

Twenty well-designed test cases covering your key scenarios beat two hundred random ones. Structure around conversation categories:

Category	Example test cases
Happy path	Standard questions with clear answers
Edge cases	Boundary conditions (day 30 of a 30-day return window)
Policy conflicts	User wants something the policy doesn't allow
Multi-part questions	Two or three questions in a single message
Emotional users	Frustrated, confused, or upset callers
Ambiguous inputs	Questions that could mean multiple things
Out-of-scope	Questions the agent shouldn't try to answer
Adversarial	Attempts to get the agent to break its rules

The golden test set

Maintain a curated set of 20-50 test cases as your regression suite. These don't change unless the underlying policy does. Every prompt edit, model change, and configuration update gets run against this set before deployment.

When a production bug surfaces, add a test case for it. Your golden set should grow over time, accumulating hard-won knowledge from every failure.

Versioning and tracking

Version your eval set like code. When you change a test case, you should know why. When scores shift between runs, you need to tell whether the agent changed or the test did.

Store eval results with metadata: prompt version, model, eval set version, timestamp. This creates the audit trail you need for debugging regressions. Production monitoring complements this by catching issues your eval set didn't anticipate.

How do you run evals in CI on every pull request?

Here's a GitHub Actions workflow that runs your eval suite and blocks merging if scores drop:

yaml

name: Agent Evals
 
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/agent/**"
      - "eval/**"
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
 
      - name: Install dependencies
        run: npm ci
 
      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: npx tsx eval/run.ts --output eval-results.json
 
      - name: Check thresholds
        run: |
          node -e "
            const r = require('./eval-results.json');
            const failed = r.results.filter(t => !t.pass);
            if (failed.length > 0) {
              console.log('FAILED EVALS:');
              failed.forEach(f => console.log('  ' + f.testCase.id + ': ' + f.averageScore));
              process.exit(1);
            }
            const avgScore = r.results.reduce((s,t) => s + t.averageScore, 0) / r.results.length;
            if (avgScore < 4.0) {
              console.log('Average score ' + avgScore + ' below threshold 4.0');
              process.exit(1);
            }
            console.log('All evals passed. Average: ' + avgScore.toFixed(2));
          "
 
      - name: Comment results on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json', 'utf8'));
            const passed = results.results.filter(r => r.pass).length;
            const total = results.results.length;
            const avg = (results.results.reduce((s,r) => s + r.averageScore, 0) / total).toFixed(2);
 
            let body = '## Agent Eval Results\n\n';
            body += '| Test | Score | Status |\n|------|-------|--------|\n';
            results.results.forEach(r => {
              const status = r.pass ? 'Pass' : 'Fail';
              body += '| ' + r.testCase.id + ' | ' + r.averageScore + ' | ' + status + ' |\n';
            });
            body += '\n**Average: ' + avg + '** | **' + passed + '/' + total + ' passed**';
 
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

The workflow triggers on prompt or agent code changes. It runs the full eval suite, checks thresholds, and posts a summary on the PR. Failed scores block merging.

Cost control. Each run calls your LLM twice per test case (agent + judge). With 30 cases, that's 60 calls — typically $0.50-$2.00 total.

Flakiness. LLM-as-judge scores have natural variance. A test scoring 3.8 on one run might hit 3.4 next time. Set your pass threshold with margin, or run each case three times and take the median.

Speed. Run test cases in parallel where possible. A 30-case suite running sequentially takes about 3 minutes. Batches of 10 bring it under a minute.

How do you catch regressions with eval baselines?

Store a baseline of passing scores in a JSON file committed to your repo. After each run, compare current scores against the baseline and flag any criterion that drops more than your threshold.

typescript

interface Baseline {
  [testCaseId: string]: {
    [criterion: string]: number; // baseline score
  };
}
 
function checkRegressions(
  results: EvalResult[],
  baseline: Baseline,
  regressionThreshold: number = 1.0
): { testId: string; criterion: string; drop: number }[] {
  const regressions: { testId: string; criterion: string; drop: number }[] = [];
 
  for (const result of results) {
    const baselineScores = baseline[result.testCase.id];
    if (!baselineScores) continue;
 
    for (const score of result.scores) {
      const baseScore = baselineScores[score.criterion];
      if (baseScore === undefined) continue;
 
      const drop = baseScore - score.score;
      if (drop >= regressionThreshold) {
        regressions.push({
          testId: result.testCase.id,
          criterion: score.criterion,
          drop,
        });
      }
    }
  }
 
  return regressions;
}
 
// Usage: after running evals, check for regressions
const regressions = checkRegressions(results, previousBaseline);
if (regressions.length > 0) {
  console.error("REGRESSIONS DETECTED:");
  regressions.forEach((r) =>
    console.error(`  ${r.testId} / ${r.criterion}: dropped ${r.drop} points`)
  );
  process.exit(1);
}

Update your baseline after every successful eval run you're happy with. This creates a quality ratchet — scores can only go up, never quietly degrade.

Advanced patterns

A/B eval comparison

When testing a new prompt version, run the same cases against both prompts and compare:

typescript

async function comparePrompts(
  testCases: TestCase[],
  promptA: string,
  promptB: string
): Promise<void> {
  const resultsA = await runEvals(testCases, promptA);
  const resultsB = await runEvals(testCases, promptB);
 
  console.log("\nA/B COMPARISON");
  console.log("=" .repeat(50));
  console.log("Test ID           | Prompt A | Prompt B | Delta");
  console.log("-".repeat(50));
 
  let totalA = 0, totalB = 0;
  for (let i = 0; i < testCases.length; i++) {
    const a = resultsA[i].averageScore;
    const b = resultsB[i].averageScore;
    const delta = b - a;
    const arrow = delta > 0 ? "+" : "";
    totalA += a;
    totalB += b;
    console.log(
      `${testCases[i].id.padEnd(18)}| ${a.toFixed(2).padEnd(9)}| ${b.toFixed(2).padEnd(9)}| ${arrow}${delta.toFixed(2)}`
    );
  }
 
  const avgA = totalA / testCases.length;
  const avgB = totalB / testCases.length;
  console.log("-".repeat(50));
  console.log(
    `Average           | ${avgA.toFixed(2).padEnd(9)}| ${avgB.toFixed(2).padEnd(9)}| ${(avgB - avgA > 0 ? "+" : "")}${(avgB - avgA).toFixed(2)}`
  );
}

This is essential for prompt engineering workflows. Instead of guessing whether a change helped, you get a clear comparison table.

Multi-turn conversation evals

Real agents handle entire conversations, not isolated questions. Evaluating multi-turn interactions requires tracking context across turns:

typescript

interface ConversationTestCase {
  id: string;
  turns: { role: "user" | "assistant"; content: string }[];
  // The last turn is the one we evaluate; earlier turns are context
  criteria: string[];
  expectedBehavior: string;
}
 
async function runConversationEval(
  client: Anthropic,
  systemPrompt: string,
  testCase: ConversationTestCase
): Promise<EvalResult> {
  // Build message history from all turns except the last user message
  const messages = testCase.turns.slice(0, -1).map((t) => ({
    role: t.role as "user" | "assistant",
    content: t.content,
  }));
 
  // Add the final user message
  const lastTurn = testCase.turns[testCase.turns.length - 1];
  messages.push({ role: "user", content: lastTurn.content });
 
  const start = Date.now();
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: systemPrompt,
    messages,
  });
 
  const output = response.content[0].type === "text"
    ? response.content[0].text : "";
 
  // Judge with full conversation context
  const scores = await judgeResponse(client, {
    id: testCase.id,
    input: testCase.turns.map(
      (t) => `${t.role}: ${t.content}`
    ).join("\n"),
    criteria: testCase.criteria,
    expectedBehavior: testCase.expectedBehavior,
  }, output);
 
  const avg = scores.reduce((s, sc) => s + sc.score, 0) / scores.length;
 
  return {
    testCase: {
      id: testCase.id,
      input: lastTurn.content,
      criteria: testCase.criteria,
      expectedBehavior: testCase.expectedBehavior,
    },
    agentOutput: output,
    scores,
    averageScore: Math.round(avg * 100) / 100,
    pass: avg >= 3.5,
    latencyMs: Date.now() - start,
  };
}

Multi-turn evals catch context loss — an agent that handles individual questions well might forget details from earlier in the conversation. Production analytics will tell you where these breakdowns happen most.

Cost-aware evaluation

Track eval costs so you can optimize. Here's a quick estimator:

typescript

function estimateCost(
  results: EvalResult[],
  pricePerKInput: number = 0.003,
  pricePerKOutput: number = 0.015
): { totalCost: number; costPerCase: number } {
  // ~200 input tokens per agent call, ~300 per judge call, ~200 output each
  const totalInputTokens = results.length * 500;
  const totalOutputTokens = results.length * 400;
 
  const inputCost = (totalInputTokens / 1000) * pricePerKInput;
  const outputCost = (totalOutputTokens / 1000) * pricePerKOutput;
  const totalCost = Math.round((inputCost + outputCost) * 10000) / 10000;
 
  return {
    totalCost,
    costPerCase: Math.round(totalCost / results.length * 10000) / 10000,
  };
}

What eval frameworks should you know about?

You don't have to build everything yourself. The ecosystem has matured.

Braintrust connects eval scoring with production tracing, dataset management, and CI-based release enforcement. Strong choice if you want a managed platform covering the full eval lifecycle.

DeepEval is open-source with plug-and-play metrics and pytest integration. Embeds directly in your test suite without a separate platform.

RAGAS focuses on RAG evaluation with research-backed retrieval and generation metrics. If your agent relies on retrieval-augmented generation, RAGAS metrics like faithfulness and answer relevancy are worth adding.

Langfuse offers open-source observability with built-in evaluation. Good for teams that want to self-host.

Promptfoo focuses on red-teaming and security validation alongside standard evals.

The harness you built earlier gives you the core patterns. These platforms add managed infrastructure, pre-built metrics, and dashboards on top of the same ideas.

Best practices checklist

Progress0/12

Where to go from here

You've got the building blocks: test case design, LLM-as-judge scoring, regression detection, CI integration, and working TypeScript code you can copy directly. That's enough to catch most issues before they reach production.

If you're just starting, get the harness running and write test cases for your ten most common customer interactions. Already running evals? Focus on CI integration and regression baselines. Got all of that? Explore multi-turn evaluation and A/B comparison for prompt optimization.

If you'd rather not build from scratch, Chanl's scorecard and scenario testing systems provide production-ready evaluation workflows — but the principles here apply regardless of tooling.

Start measuring. Stop guessing.

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluations testing llm typescript python learning-ai

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.