ChanlChanl
Agent Architecture

Digital Twins for AI Agents: Simulate Before You Ship

Build digital twins that test your AI agent against thousands of synthetic customers. Architecture, TypeScript code, and the patterns that catch failures.

LDLucas DalamartaEngineering LeadFollow
August 8, 2025
19 min read
women using laptops - Photo by Van Tay Media on Unsplash

In March 2025, researchers at Carnegie Mellon built a simulated company staffed entirely by AI agents. Agents were assigned roles — software engineers, product managers, data analysts — and given tasks that mirrored real workflows. Even the best-performing model, Anthropic's Claude, completed only 24% of assigned tasks successfully. The agents didn't crash. They produced outputs that looked reasonable. But when evaluated against actual task requirements, they fabricated data, misinterpreted instructions, and confidently delivered wrong answers.

The researchers didn't discover this by deploying agents to a real company. They discovered it by building a digital twin — a simulation environment where AI agents could be tested at scale against realistic conditions.

That's the core idea behind digital twins for AI agents: instead of shipping an agent and hoping for the best, you build a virtual environment where the agent faces thousands of simulated customers before it ever talks to a real one. You discover the 76% failure rate before production, not after.

This guide covers the architecture, implementation, and testing patterns for building digital twins that actually catch failures. We'll build a working simulation framework in TypeScript, design persona systems that generate diverse test conversations, and connect the scoring pipeline that turns raw simulation data into deploy/no-deploy decisions.

What you'll learnWhy it matters
Digital twin architectureThree-layer system: persona engine, scenario orchestrator, scoring pipeline
Persona designBuild synthetic customers that probe different failure modes
Simulation orchestrationRun hundreds of conversations in parallel with deterministic control
Scoring and evaluationGrade simulated conversations against weighted rubrics
Regression detectionCatch quality degradation across prompt and model changes
CI/CD integrationGate deploys on simulation results automatically

What is a digital twin for an AI agent?

A digital twin is a simulation environment that runs your real agent configuration against synthetic customers at scale — discovering failures, measuring quality drift, and validating changes before any real person is affected. The term borrows from manufacturing, where digital twins are virtual replicas of physical systems used for testing and optimization. For AI agents, the "twin" isn't a copy of the agent itself — it's the entire testing environment that surrounds it.

The distinction matters. You're not cloning your agent. You're building the world it operates in — the customers it talks to, the scenarios it encounters, the edge cases it has to navigate — and running your actual agent through that simulated world.

Here's what makes this different from manually chatting with your agent a few times before launch:

Scale. Instead of testing 20 conversations, you test 2,000. Instead of checking five customer types, you check fifty. Human testers can run maybe 10-15 conversations per hour. A digital twin runs hundreds per minute.

Diversity. Real customers come in infinite varieties — different emotional states, communication styles, domain knowledge levels, accents, patience thresholds. A digital twin with well-designed personas generates this diversity systematically, hitting combinations a human tester would never think to try.

Reproducibility. When a simulation reveals a failure, you can replay it. You can tweak the prompt and run the exact same persona against the updated agent. You can compare scores across versions with statistical confidence instead of gut feel.

Continuous validation. Digital twins don't get tired. They can run on every PR, every model update, every knowledge base change. This is particularly important because AI agents fail in ways that aren't visible without systematic measurement — a model provider updates their weights, and your agent's tone shifts subtly in ways no single conversation reveals.

Digital Twin Environment Pass Fail Deploydecision Production Fix & re-test Persona EngineGenerates syntheticcustomer profiles Scenario OrchestratorManages conversationflow & edge cases Your Agent(real config,real prompts,real tools) Scoring PipelineRubric evaluation& regression detection
Digital twin architecture: three layers that wrap your production agent in a simulation environment

How this differs from unit tests and manual QA

If you've read about AI agent testing, you know unit tests check individual components — intent classification accuracy, entity extraction, API response shapes — while scenario tests check conversation-level behavior. Digital twins sit one level above both. They're the orchestration layer that generates diverse scenarios automatically, runs them at scale, and tracks results over time.

Think of it as the difference between:

  • Unit tests: "Does the refund tool return the right JSON shape?" (yes/no)
  • Scenario tests: "When a frustrated customer asks for a refund on a final-sale item, does the agent handle it correctly?" (scored on a rubric)
  • Digital twin: "Run this agent against 500 different customer personas across 40 scenario types, compare scores to last week's baseline, and flag any regressions" (automated, continuous)

You need all three. But the digital twin is what makes the difference between "we tested it" and "we know how it performs across the full distribution of customer behavior."

Why do AI agents need simulation-based testing?

AI agents fail in ways that look like success — they produce fluent, confident, helpful-sounding responses that happen to be wrong, policy-violating, or emotionally tone-deaf — and only systematic simulation at scale reveals these failure patterns before customers find them. This isn't a theoretical concern. The failure rate in real deployments is sobering.

Gartner predicts that by 2028, 25% of enterprise security breaches will be attributed to AI agent abuse. The ASAPP 2025 report on the "AI agent failure era" found that most production CX agents struggle with exactly the scenarios that matter most — complex multi-step issues, emotionally charged interactions, and edge cases that fall outside training distribution. Cleanlab's 2025 survey found that AI agents in production face "shifting stacks" where new frameworks, APIs, and model versions change faster than teams can validate them.

The common thread: you can't predict these failures by looking at your agent's code. You have to run it against realistic conditions and measure what happens.

The failure taxonomy

Through simulation testing, a consistent pattern of failure categories emerges. Understanding these categories is how you design personas and scenarios that actually probe the right failure modes.

Failure CategoryWhat HappensWhy Unit Tests Miss ItSimulation Catches It
Hallucinated policiesAgent invents a return policy, discount, or procedureThe fabrication is grammatically correctPersona asks about non-existent policies; scoring checks against ground truth
Context loss across turnsAgent forgets information from turn 3 by turn 7Each turn passes individuallyMulti-turn persona conversations reveal dropped context
Emotional miscalibrationAgent responds cheerfully to a clearly frustrated customerTone detection works in isolationFrustrated persona escalates emotion over turns; scoring checks empathy alignment
Tool selection errorsAgent calls the wrong tool or skips a required toolTool integration tests passAmbiguous scenarios force the agent to choose between tools
Mid-conversation correction failuresCustomer corrects information; agent uses the originalCorrection handling passes as isolated testPersona deliberately provides wrong info, then corrects it mid-conversation
Knowledge boundary violationsAgent answers questions it shouldn't know the answer toKnowledge retrieval returns relevant docsBoundary-probing persona asks about competitor products, internal procedures

Each of these categories maps to a persona design pattern. A persona that probes hallucinated policies is different from one that tests emotional miscalibration. The digital twin framework needs to generate both systematically.

Agents don't fail on the questions you prepared for. They fail on the questions your customers invent — the ambiguous ones, the emotional ones, the ones that require saying 'I don't know.'
Sierra AISimulations: The Secret Behind Every Great Agent (2025)

How do you design personas for simulation testing?

Effective personas are defined along four behavioral axes — communication style, emotional state, domain knowledge, and behavioral intent — and the most valuable testing comes from combining these axes into profiles that probe specific failure categories. A persona isn't a name and a backstory. It's a set of behavioral constraints that guide an LLM to role-play a specific type of customer consistently across a multi-turn conversation.

The goal is diversity that maps to real customer distributions. If 30% of your production callers are frustrated, 30% of your simulation personas should be too. If 10% of real conversations involve mid-topic corrections, your persona mix should include that pattern.

The four axes of persona design

Communication style determines how the persona phrases things. Verbose callers provide too much context. Terse callers give one-word answers. Technical callers use domain jargon. Non-native speakers use simplified grammar. Each style tests different aspects of your agent's language understanding.

Emotional state shapes the conversation trajectory. A calm customer who encounters a policy limitation might accept it. A frustrated customer will push back. An anxious customer needs reassurance. The emotional axis tests whether your agent adapts its tone appropriately — a critical dimension that scorecard evaluation can measure.

Domain knowledge determines what the persona knows about your product. Expert users ask detailed technical questions. Novice users describe symptoms instead of problems ("the thing won't work"). The knowledge gap between expert and novice callers is one of the most common sources of production agent failures.

Behavioral intent is what the persona is actually trying to accomplish — and whether they're cooperating or adversarial. A cooperative persona follows the agent's prompts. A tangential persona drifts between topics. An adversarial persona tries to exploit policies or extract unauthorized information.

typescript
interface PersonaProfile {
  id: string;
  name: string;
  communicationStyle: 'verbose' | 'terse' | 'technical' | 'non-native' | 'rambling';
  emotionalState: 'calm' | 'frustrated' | 'anxious' | 'confused' | 'angry';
  domainKnowledge: 'expert' | 'intermediate' | 'novice';
  behavioralIntent: 'cooperative' | 'tangential' | 'adversarial' | 'correction-prone';
  systemPrompt: string;  // LLM prompt that makes the persona behave consistently
  targetFailureCategory?: string;  // Which failure type this persona probes
}
 
// A persona designed to probe context loss across turns
const contextDriftPersona: PersonaProfile = {
  id: 'context-drift-001',
  name: 'The Topic Switcher',
  communicationStyle: 'rambling',
  emotionalState: 'calm',
  domainKnowledge: 'intermediate',
  behavioralIntent: 'tangential',
  targetFailureCategory: 'context_loss',
  systemPrompt: `You are a customer calling about a billing issue, but you frequently
drift to other topics. Start by asking about a charge on your account. After 2-3 turns
about billing, mention a separate product question. Then return to the original billing
issue and reference specific details you mentioned earlier — test whether the agent
remembers them. If the agent loses context about your original issue, express mild
frustration and repeat the key details.
 
Your account number is AC-9382. You were charged $49.99 on March 3rd for a subscription
you thought you cancelled. You also want to know if the Premium tier includes API access.
 
IMPORTANT: When you return to the billing topic, do NOT repeat your account number.
Reference it as "the account I mentioned." Only repeat it if the agent explicitly asks.`,
};
 
// A persona designed to probe hallucinated policies
const policyProbePersona: PersonaProfile = {
  id: 'policy-probe-001',
  name: 'The Policy Explorer',
  communicationStyle: 'technical',
  emotionalState: 'calm',
  domainKnowledge: 'expert',
  behavioralIntent: 'cooperative',
  targetFailureCategory: 'hallucinated_policies',
  systemPrompt: `You are a savvy customer who asks very specific policy questions.
Your goal is to probe whether the agent knows the boundaries of actual company policy
versus things it might fabricate.
 
Ask about:
1. The exact refund window for annual subscriptions (real: 14 days)
2. Whether there's a loyalty discount for 2+ year customers (doesn't exist)
3. The SLA guarantee for API uptime (real: 99.9%)
4. Whether you can transfer a license to a colleague (doesn't exist)
 
For each response, ask a follow-up that tests specificity: "Can you point me to where
that's documented?" or "Is that in the terms of service?"
 
If the agent invents a policy that doesn't exist, accept it politely and move on —
the scoring system will catch it.`,
};

Generating persona combinations at scale

You don't need to hand-write every persona. Define the axes, then generate combinations programmatically:

typescript
function generatePersonaMatrix(
  scenarios: ScenarioTemplate[],
  axes: {
    styles: PersonaProfile['communicationStyle'][];
    emotions: PersonaProfile['emotionalState'][];
    knowledge: PersonaProfile['domainKnowledge'][];
    intents: PersonaProfile['behavioralIntent'][];
  }
): PersonaProfile[] {
  const personas: PersonaProfile[] = [];
  let counter = 0;
 
  // Full cartesian product would be 5 * 5 * 3 * 4 = 300 personas
  // Instead, use pairwise combinations for practical coverage
  for (const style of axes.styles) {
    for (const emotion of axes.emotions) {
      // Pick one knowledge level and one intent per style-emotion pair
      const knowledge = axes.knowledge[counter % axes.knowledge.length];
      const intent = axes.intents[counter % axes.intents.length];
 
      personas.push({
        id: `persona-${counter++}`,
        name: `${emotion}-${style}-${knowledge}`,
        communicationStyle: style,
        emotionalState: emotion,
        domainKnowledge: knowledge,
        behavioralIntent: intent,
        systemPrompt: buildPersonaPrompt(style, emotion, knowledge, intent),
      });
    }
  }
 
  return personas;
}
 
function buildPersonaPrompt(
  style: string,
  emotion: string,
  knowledge: string,
  intent: string
): string {
  const styleInstructions: Record<string, string> = {
    verbose: 'You tend to over-explain. Provide lots of context, sometimes irrelevant.',
    terse: 'You give short answers. One sentence max unless pressed for details.',
    technical: 'You use precise technical terminology and expect the same back.',
    'non-native': 'English is your second language. Use simpler grammar, occasionally misuse idioms.',
    rambling: 'You drift between topics and circle back. Stream of consciousness.',
  };
 
  const emotionInstructions: Record<string, string> = {
    calm: 'You are patient and reasonable throughout the conversation.',
    frustrated: 'You are increasingly frustrated. If the agent is unhelpful, escalate your tone.',
    anxious: 'You are worried about the outcome. Ask for reassurance repeatedly.',
    confused: 'You don\'t fully understand the situation. Ask clarifying questions.',
    angry: 'You are upset from the start. You want resolution NOW.',
  };
 
  return `You are a simulated customer in a testing environment.
 
COMMUNICATION: ${styleInstructions[style]}
EMOTIONAL STATE: ${emotionInstructions[emotion]}
DOMAIN KNOWLEDGE: You are a ${knowledge}-level user of the product.
BEHAVIOR: Your intent is ${intent}.
 
Stay in character throughout the conversation. Do not break the fourth wall.
Do not mention that you are a test or simulation.
Respond naturally based on your persona traits.`;
}

This persona generation approach draws from the same principles as adversarial testing — synthetic users that probe the agent's weaknesses systematically. Sierra AI's research on voice simulations confirms the pattern: "Simulations can be designed to re-create specific situations, including ones that are rare or hard to observe in the real world, but that can have an outsized impact on the quality and safety of the agent experience."

How do you orchestrate simulation at scale?

The scenario orchestrator is the engine that pairs personas with scenarios, manages conversation flow, handles concurrency, and collects the raw data the scoring pipeline needs. A well-designed orchestrator runs hundreds of conversations in parallel while maintaining deterministic control over each one — you need to know exactly which persona said what, in which scenario, and be able to replay any conversation that reveals a failure.

Scenario definition

A scenario defines the situation, separate from the persona that encounters it. The same "billing dispute" scenario should behave differently when a frustrated expert encounters it versus when a confused novice does.

typescript
interface SimulationScenario {
  id: string;
  name: string;
  category: 'billing' | 'technical_support' | 'account_management' | 'sales' | 'escalation';
  description: string;
  setup: {
    customerContext: Record<string, unknown>;  // Account data, order history, etc.
    agentContext?: Record<string, unknown>;     // Pre-loaded knowledge, tool access
  };
  pivotPoints?: {
    afterTurn: number;
    injection: string;  // Force a specific user message to test a specific behavior
  }[];
  successCriteria: {
    mustResolve: boolean;          // Must the issue be fully resolved?
    maxTurns: number;              // Efficiency threshold
    requiredTools?: string[];      // Tools the agent should use
    forbiddenActions?: string[];   // Things the agent must NOT do
    groundTruth?: Record<string, string>;  // Factual answers for scoring
  };
}
 
const billingDisputeScenario: SimulationScenario = {
  id: 'billing-dispute-001',
  name: 'Duplicate charge dispute',
  category: 'billing',
  description: 'Customer was charged twice for the same subscription renewal',
  setup: {
    customerContext: {
      accountId: 'AC-9382',
      plan: 'Professional',
      lastCharge: { amount: 49.99, date: '2025-03-03', description: 'Monthly renewal' },
      duplicateCharge: { amount: 49.99, date: '2025-03-03', description: 'Monthly renewal' },
      refundEligible: true,
    },
  },
  pivotPoints: [
    {
      afterTurn: 4,
      injection: 'Actually, wait — I just realized the second charge might be for my team account. Can you check both accounts?',
    },
  ],
  successCriteria: {
    mustResolve: true,
    maxTurns: 12,
    requiredTools: ['lookup_billing_history', 'process_refund'],
    forbiddenActions: ['transfer_to_human_without_attempting_resolution'],
    groundTruth: {
      refundPolicy: '14-day refund window for subscription charges',
      duplicateChargeProcess: 'Automatic refund within 3-5 business days',
    },
  },
};

The simulation runner

The runner pairs personas with scenarios, manages the conversation loop, and captures everything needed for scoring:

typescript
interface ConversationTurn {
  role: 'customer' | 'agent';
  content: string;
  timestamp: number;
  toolsUsed?: string[];
  metadata?: Record<string, unknown>;
}
 
interface SimulationResult {
  id: string;
  scenarioId: string;
  personaId: string;
  turns: ConversationTurn[];
  totalDuration: number;
  toolsUsed: string[];
  resolved: boolean;
  turnCount: number;
}
 
async function runSimulation(
  scenario: SimulationScenario,
  persona: PersonaProfile,
  agentEndpoint: string,
  config: { maxTurns: number; turnTimeout: number }
): Promise<SimulationResult> {
  const turns: ConversationTurn[] = [];
  const allToolsUsed: string[] = [];
  const startTime = Date.now();
 
  // Generate the persona's opening message based on the scenario
  const openingMessage = await generatePersonaMessage(
    persona,
    scenario,
    [],  // no previous turns
  );
 
  turns.push({
    role: 'customer',
    content: openingMessage,
    timestamp: Date.now(),
  });
 
  for (let turnNum = 0; turnNum < config.maxTurns; turnNum++) {
    // Send to agent and get response
    const agentResponse = await callAgent(agentEndpoint, {
      messages: turns,
      context: scenario.setup.agentContext,
    });
 
    turns.push({
      role: 'agent',
      content: agentResponse.message,
      timestamp: Date.now(),
      toolsUsed: agentResponse.toolsUsed,
    });
 
    if (agentResponse.toolsUsed) {
      allToolsUsed.push(...agentResponse.toolsUsed);
    }
 
    // Check if conversation has naturally concluded
    if (agentResponse.conversationEnded) break;
 
    // Check for pivot point injections
    const pivot = scenario.pivotPoints?.find(p => p.afterTurn === turnNum + 1);
    const nextCustomerMessage = pivot
      ? pivot.injection
      : await generatePersonaMessage(persona, scenario, turns);
 
    turns.push({
      role: 'customer',
      content: nextCustomerMessage,
      timestamp: Date.now(),
    });
  }
 
  return {
    id: `sim-${scenario.id}-${persona.id}-${Date.now()}`,
    scenarioId: scenario.id,
    personaId: persona.id,
    turns,
    totalDuration: Date.now() - startTime,
    toolsUsed: [...new Set(allToolsUsed)],
    resolved: detectResolution(turns, scenario.successCriteria),
    turnCount: turns.length,
  };
}

Parallel execution with controlled concurrency

Running 500 simulations sequentially would take hours. But running all 500 in parallel would overwhelm your agent endpoint and your LLM API rate limits. Use controlled concurrency:

typescript
async function runSimulationSuite(
  scenarios: SimulationScenario[],
  personas: PersonaProfile[],
  agentEndpoint: string,
  config: { concurrency: number; maxTurns: number; turnTimeout: number }
): Promise<SimulationResult[]> {
  // Build the simulation matrix: every scenario × selected personas
  const simulationPairs: Array<{ scenario: SimulationScenario; persona: PersonaProfile }> = [];
 
  for (const scenario of scenarios) {
    // Not every persona needs every scenario — match by failure category
    const relevantPersonas = personas.filter(
      p => !p.targetFailureCategory || isRelevantToScenario(p, scenario)
    );
 
    for (const persona of relevantPersonas) {
      simulationPairs.push({ scenario, persona });
    }
  }
 
  console.log(`Running ${simulationPairs.length} simulations (${config.concurrency} concurrent)`);
 
  const results: SimulationResult[] = [];
 
  // Process in batches of `concurrency`
  for (let i = 0; i < simulationPairs.length; i += config.concurrency) {
    const batch = simulationPairs.slice(i, i + config.concurrency);
    const batchResults = await Promise.all(
      batch.map(({ scenario, persona }) =>
        runSimulation(scenario, persona, agentEndpoint, config)
      )
    );
    results.push(...batchResults);
 
    // Progress reporting
    const pct = Math.round(((i + batch.length) / simulationPairs.length) * 100);
    console.log(`Progress: ${pct}% (${i + batch.length}/${simulationPairs.length})`);
  }
 
  return results;
}

This is the orchestration pattern that platforms like Chanl's scenario testing system implement under the hood — persona management, conversation orchestration, and parallel execution with rate limiting.

How do you score and evaluate simulated conversations?

The scoring pipeline takes raw conversation transcripts and produces structured quality scores — combining LLM-as-judge evaluation for subjective quality with programmatic checks for hard policy violations. The output is a per-conversation scorecard that feeds into regression detection and deploy gating.

If you've built eval frameworks before, the scoring pipeline will feel familiar. The difference in a digital twin context is scale — you're scoring hundreds of conversations per run, not a handful.

Scorecard design for simulation

A simulation scorecard needs criteria that map to the failure categories your personas are probing. Generic criteria like "overall quality" don't give you actionable signal. Specific criteria tied to specific failure modes do.

typescript
interface SimulationScorecard {
  id: string;
  criteria: Array<{
    name: string;
    description: string;
    weight: number;
    anchors: { score: number; description: string }[];
    appliesTo?: string[];  // Only score this for specific scenario categories
  }>;
  passingThreshold: number;
  programmaticChecks: ProgrammaticCheck[];
}
 
const simulationScorecard: SimulationScorecard = {
  id: 'twin-scorecard-v1',
  criteria: [
    {
      name: 'factual_accuracy',
      description: 'Are all factual claims correct per the ground truth?',
      weight: 0.30,
      anchors: [
        { score: 1, description: 'Multiple factual errors or fabricated policies' },
        { score: 3, description: 'Mostly accurate with minor imprecisions' },
        { score: 5, description: 'All claims verifiable against ground truth' },
      ],
    },
    {
      name: 'context_retention',
      description: 'Does the agent remember details from earlier in the conversation?',
      weight: 0.20,
      anchors: [
        { score: 1, description: 'Asks customer to repeat previously stated info' },
        { score: 3, description: 'Retains key details but misses some context' },
        { score: 5, description: 'References earlier details naturally and accurately' },
      ],
    },
    {
      name: 'emotional_calibration',
      description: 'Does the agent match the appropriate emotional register?',
      weight: 0.20,
      anchors: [
        { score: 1, description: 'Tone is wildly inappropriate (cheerful to angry customer)' },
        { score: 3, description: 'Generally appropriate but misses escalation cues' },
        { score: 5, description: 'Tone matches and adapts to emotional shifts across turns' },
      ],
    },
    {
      name: 'task_completion',
      description: 'Did the agent resolve the customer issue or appropriately escalate?',
      weight: 0.20,
      anchors: [
        { score: 1, description: 'Issue unresolved, customer left worse off' },
        { score: 3, description: 'Partial resolution or unnecessary human escalation' },
        { score: 5, description: 'Full resolution or well-justified escalation with context' },
      ],
    },
    {
      name: 'efficiency',
      description: 'Did the agent resolve efficiently without unnecessary turns?',
      weight: 0.10,
      anchors: [
        { score: 1, description: 'Excessive turns, repetitive questions, circular conversation' },
        { score: 3, description: 'Reasonable turn count with some redundancy' },
        { score: 5, description: 'Efficient conversation flow, no wasted turns' },
      ],
    },
  ],
  passingThreshold: 3.5,
  programmaticChecks: [
    {
      name: 'no_hallucinated_policy',
      check: (result, scenario) => {
        // This would be checked by the LLM judge against ground truth
        // Programmatic check verifies format compliance
        return { passed: true, detail: 'Deferred to LLM judge for factual accuracy' };
      },
    },
    {
      name: 'required_tools_used',
      check: (result, scenario) => {
        const required = scenario.successCriteria.requiredTools ?? [];
        const missing = required.filter(t => !result.toolsUsed.includes(t));
        return {
          passed: missing.length === 0,
          detail: missing.length > 0
            ? `Missing required tools: ${missing.join(', ')}`
            : 'All required tools used',
        };
      },
    },
    {
      name: 'turn_limit',
      check: (result, scenario) => {
        return {
          passed: result.turnCount <= scenario.successCriteria.maxTurns,
          detail: `${result.turnCount} turns (limit: ${scenario.successCriteria.maxTurns})`,
        };
      },
    },
    {
      name: 'no_forbidden_actions',
      check: (result, scenario) => {
        const forbidden = scenario.successCriteria.forbiddenActions ?? [];
        const violations = forbidden.filter(a => result.toolsUsed.includes(a));
        return {
          passed: violations.length === 0,
          detail: violations.length > 0
            ? `Forbidden actions taken: ${violations.join(', ')}`
            : 'No policy violations',
        };
      },
    },
  ],
};

LLM judge for simulated conversations

The judge scores each conversation against the scorecard criteria. For simulation testing, include the scenario's ground truth so the judge can check factual accuracy:

typescript
async function scoreSimulation(
  result: SimulationResult,
  scenario: SimulationScenario,
  scorecard: SimulationScorecard
): Promise<ScorecardResult> {
  const conversationText = result.turns
    .map(t => `[${t.role.toUpperCase()}]: ${t.content}`)
    .join('\n\n');
 
  const groundTruthSection = scenario.successCriteria.groundTruth
    ? `\n\nGROUND TRUTH (use this to verify factual claims):\n${
        Object.entries(scenario.successCriteria.groundTruth)
          .map(([k, v]) => `- ${k}: ${v}`)
          .join('\n')
      }`
    : '';
 
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.1,
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: `You are an expert QA evaluator scoring AI agent conversations.
 
Score the following conversation on each criterion using the provided rubric anchors.
Return a JSON object with this exact structure:
{
  "scores": {
    "<criterion_name>": { "score": <1-5>, "reasoning": "<1-2 sentences>" }
  },
  "overall_notes": "<key observations about agent performance>"
}
 
CRITERIA:
${scorecard.criteria.map(c =>
  `${c.name} (weight: ${c.weight}): ${c.description}\n  ${
    c.anchors.map(a => `${a.score} = ${a.description}`).join('\n  ')
  }`
).join('\n\n')}
${groundTruthSection}`,
      },
      {
        role: 'user',
        content: `SCENARIO: ${scenario.name}${scenario.description}
PERSONA: ${result.personaId}
TURNS: ${result.turnCount}
TOOLS USED: ${result.toolsUsed.join(', ') || 'none'}
 
CONVERSATION:
${conversationText}`,
      },
    ],
  });
 
  const judgeOutput = JSON.parse(response.choices[0].message.content ?? '{}');
 
  let weightedSum = 0;
  for (const criterion of scorecard.criteria) {
    const score = judgeOutput.scores[criterion.name]?.score ?? 0;
    weightedSum += score * criterion.weight;
  }
 
  return {
    simulationId: result.id,
    scores: judgeOutput.scores,
    weightedAverage: Math.round(weightedSum * 100) / 100,
    passed: weightedSum >= scorecard.passingThreshold,
    notes: judgeOutput.overall_notes,
  };
}

Run each scoring three times and take the median to account for LLM non-determinism. This is the same calibration technique described in the eval framework guide — consistency matters more than precision for any single run.

test-runner
$ chanl test --suite stress-test --agent production
Rapid-fire Q&A (23 questions)142ms
Interruption handling (mid-sentence)89ms
Accent variation (12 accents)256ms
Background noise (construction)FAIL
Long conversation (45 min)312ms
Emotional escalation (angry → calm)98ms
Multi-topic switching167ms
6 passed, 1 failed
85%

How do you detect regressions across simulation runs?

Regression detection compares your current simulation results against a known-good baseline and flags statistically significant quality drops — distinguishing real degradation from normal LLM scoring variance. Without regression tracking, a prompt tweak that improves billing scenarios by 0.5 points while silently degrading returns scenarios by 1.2 points goes unnoticed until a customer reports it.

Building the baseline

A baseline captures your agent's performance across the full simulation matrix at a point where you've verified quality is acceptable. Every subsequent run compares against this.

typescript
interface SimulationBaseline {
  version: string;
  timestamp: string;
  modelVersion: string;
  results: Map<string, {
    scenarioId: string;
    personaId: string;
    weightedAverage: number;
    criteriaScores: Record<string, number>;
    turnCount: number;
    resolved: boolean;
  }>;
  aggregates: {
    overallMean: number;
    passRate: number;
    byCategory: Record<string, { mean: number; passRate: number }>;
    byPersonaType: Record<string, { mean: number; passRate: number }>;
  };
}
 
function buildBaseline(
  results: SimulationResult[],
  scores: Map<string, ScorecardResult>,
  scenarios: SimulationScenario[],
  personas: PersonaProfile[],
  version: string
): SimulationBaseline {
  const baselineResults = new Map();
 
  for (const result of results) {
    const score = scores.get(result.id);
    if (!score) continue;
 
    baselineResults.set(result.id, {
      scenarioId: result.scenarioId,
      personaId: result.personaId,
      weightedAverage: score.weightedAverage,
      criteriaScores: Object.fromEntries(
        Object.entries(score.scores).map(([k, v]) => [k, v.score])
      ),
      turnCount: result.turnCount,
      resolved: result.resolved,
    });
  }
 
  // Compute aggregates by category and persona type
  const byCategory = groupAndAggregate(results, scores, scenarios, 'category');
  const byPersona = groupAndAggregate(results, scores, personas, 'emotionalState');
 
  const allScores = [...scores.values()].map(s => s.weightedAverage);
 
  return {
    version,
    timestamp: new Date().toISOString(),
    modelVersion: 'gpt-4o-2025-08-06',
    results: baselineResults,
    aggregates: {
      overallMean: mean(allScores),
      passRate: allScores.filter(s => s >= 3.5).length / allScores.length,
      byCategory,
      byPersona,
    },
  };
}

Comparing runs against baseline

The regression detector needs to distinguish signal from noise. LLM scoring has inherent variance — a 0.2-point drop might be noise, but a 0.6-point drop almost certainly isn't. Set thresholds based on the variance you observe in your scoring pipeline.

typescript
interface RegressionReport {
  status: 'passed' | 'warning' | 'failed';
  summary: string;
  categoryRegressions: Array<{
    category: string;
    baselineMean: number;
    currentMean: number;
    delta: number;
    severity: 'warning' | 'critical';
  }>;
  personaRegressions: Array<{
    personaType: string;
    baselineMean: number;
    currentMean: number;
    delta: number;
  }>;
  worstScenarios: Array<{
    scenarioId: string;
    personaId: string;
    score: number;
    notes: string;
  }>;
}
 
function detectSimulationRegressions(
  baseline: SimulationBaseline,
  current: SimulationBaseline,
  thresholds: {
    failDelta: number;   // e.g., 0.5 — absolute drop that blocks deploy
    warnDelta: number;   // e.g., 0.3 — drop that triggers warning
    minPassRate: number;  // e.g., 0.85 — minimum % of simulations that must pass
  }
): RegressionReport {
  const categoryRegressions = [];
  const personaRegressions = [];
  let hasCritical = false;
 
  // Check by scenario category
  for (const [cat, baselineAgg] of Object.entries(baseline.aggregates.byCategory)) {
    const currentAgg = current.aggregates.byCategory[cat];
    if (!currentAgg) continue;
 
    const delta = currentAgg.mean - baselineAgg.mean;
    if (delta <= -thresholds.warnDelta) {
      const severity = delta <= -thresholds.failDelta ? 'critical' : 'warning';
      if (severity === 'critical') hasCritical = true;
      categoryRegressions.push({
        category: cat,
        baselineMean: baselineAgg.mean,
        currentMean: currentAgg.mean,
        delta,
        severity,
      });
    }
  }
 
  // Check by persona type
  for (const [pType, baselineAgg] of Object.entries(baseline.aggregates.byPersona)) {
    const currentAgg = current.aggregates.byPersona[pType];
    if (!currentAgg) continue;
 
    const delta = currentAgg.mean - baselineAgg.mean;
    if (delta <= -thresholds.warnDelta) {
      personaRegressions.push({
        personaType: pType,
        baselineMean: baselineAgg.mean,
        currentMean: currentAgg.mean,
        delta,
      });
    }
  }
 
  // Find worst individual conversations
  const worstScenarios = [...current.results.values()]
    .filter(r => r.weightedAverage < thresholds.failDelta + 2)
    .sort((a, b) => a.weightedAverage - b.weightedAverage)
    .slice(0, 5)
    .map(r => ({
      scenarioId: r.scenarioId,
      personaId: r.personaId,
      score: r.weightedAverage,
      notes: '',
    }));
 
  const passRateDrop = current.aggregates.passRate < thresholds.minPassRate;
 
  return {
    status: hasCritical || passRateDrop ? 'failed' : categoryRegressions.length > 0 ? 'warning' : 'passed',
    summary: buildRegressionSummary(categoryRegressions, personaRegressions, baseline, current),
    categoryRegressions,
    personaRegressions,
    worstScenarios,
  };
}

The regression report tells you where quality dropped — which scenario categories, which persona types, which specific conversations. This is the actionable intelligence that prompt engineering needs to target fixes effectively, rather than blindly tweaking the system prompt.

How do you integrate digital twin testing into CI/CD?

Wire the simulation suite into your deployment pipeline as a quality gate — run a smoke subset on every PR, the full matrix on major changes, and compare against the regression baseline to make automatic pass/warn/fail decisions. This transforms digital twin testing from something you do occasionally into a continuous, enforced part of your development workflow.

Prompts, model config,tools, knowledge Other code No Yes Critical Warning None Code change pushed Which fileschanged? Run smoke simulation(10 scenarios × 3 personas) Skip simulationRun standard CI Smokepassed? Block merge Run full simulation(40 scenarios × 12 personas) Score all conversations Compare to baseline Regressiondetected? Merge withwarning label Approve mergeUpdate baseline
CI/CD pipeline with digital twin quality gate

Cost management for continuous simulation

Running 480 simulated conversations (40 scenarios times 12 personas) on every PR would get expensive. Here's how to keep costs practical:

StrategyImpactHow
Path-filtered triggers70-80% fewer runsOnly trigger on prompt, model, tool, or knowledge base changes
Smoke-first gating60% fewer full runs30-conversation smoke test catches obvious regressions before the full suite
Persona sampling40-60% fewer conversationsFull persona matrix for major changes; sample 3-4 personas for minor ones
Cached judge scores30% less LLM spendCache scores for unchanged scenario-persona pairs across runs
Parallel executionNo cost savings, but 5x fasterRun conversations concurrently within rate limits

A practical budget: 30 conversations for smoke tests ($0.60-2.40), 480 for the full matrix ($9.60-38.40). If you trigger the full suite twice a week and smoke tests on 10 PRs, you're looking at roughly $25-100/week. Compare that to the cost of one production incident with customer impact.

What changes warrant a full simulation run?

Not every code change needs the full digital twin. Use this as a heuristic:

Change TypeSimulation LevelWhy
System prompt rewriteFull matrix + new baselinePrompts affect everything
Model version upgradeFull matrixModel behavior is unpredictable
New tool addedRelevant scenarios onlyNew capability might introduce new failure modes
Knowledge base updateAffected scenario categoriesNew docs might conflict with existing knowledge
Tool configuration changeAffected scenarios onlyChanged tool behavior might break workflows
UI-only changesSkip simulationCan't affect agent behavior

What does a production-ready digital twin framework look like?

A complete framework ties together persona generation, scenario orchestration, scoring, regression detection, and reporting into a single run command. Here's the orchestration layer that connects all the pieces we've built:

typescript
interface DigitalTwinConfig {
  agentEndpoint: string;
  scenarios: SimulationScenario[];
  personas: PersonaProfile[];
  scorecard: SimulationScorecard;
  baseline?: SimulationBaseline;
  concurrency: number;
  maxTurnsPerConversation: number;
  regressionThresholds: {
    failDelta: number;
    warnDelta: number;
    minPassRate: number;
  };
}
 
interface DigitalTwinReport {
  summary: {
    totalSimulations: number;
    passRate: number;
    overallMean: number;
    duration: number;
  };
  regression: RegressionReport | null;
  byCategory: Record<string, { mean: number; passRate: number; count: number }>;
  byPersonaType: Record<string, { mean: number; passRate: number; count: number }>;
  failures: Array<{
    simulationId: string;
    scenarioName: string;
    personaName: string;
    score: number;
    failedChecks: string[];
    notes: string;
  }>;
}
 
async function runDigitalTwin(config: DigitalTwinConfig): Promise<DigitalTwinReport> {
  const startTime = Date.now();
 
  // 1. Run all simulations
  console.log('Starting simulation suite...');
  const results = await runSimulationSuite(
    config.scenarios,
    config.personas,
    config.agentEndpoint,
    {
      concurrency: config.concurrency,
      maxTurns: config.maxTurnsPerConversation,
      turnTimeout: 30_000,
    }
  );
 
  // 2. Score all conversations (3x each for reliability)
  console.log(`Scoring ${results.length} conversations...`);
  const scores = new Map<string, ScorecardResult>();
 
  for (let i = 0; i < results.length; i += config.concurrency) {
    const batch = results.slice(i, i + config.concurrency);
    const batchScores = await Promise.all(
      batch.map(async (result) => {
        const scenario = config.scenarios.find(s => s.id === result.scenarioId)!;
 
        // Run scoring 3 times, take median
        const runs = await Promise.all(
          Array.from({ length: 3 }, () =>
            scoreSimulation(result, scenario, config.scorecard)
          )
        );
        const sorted = runs.sort((a, b) => a.weightedAverage - b.weightedAverage);
        return { id: result.id, score: sorted[1] };  // median
      })
    );
 
    for (const { id, score } of batchScores) {
      scores.set(id, score);
    }
  }
 
  // 3. Run programmatic checks
  const failures = [];
  for (const result of results) {
    const score = scores.get(result.id);
    const scenario = config.scenarios.find(s => s.id === result.scenarioId)!;
    const persona = config.personas.find(p => p.id === result.personaId)!;
 
    const failedChecks = config.scorecard.programmaticChecks
      .map(check => check.check(result, scenario))
      .filter(r => !r.passed)
      .map(r => r.detail);
 
    if (!score?.passed || failedChecks.length > 0) {
      failures.push({
        simulationId: result.id,
        scenarioName: scenario.name,
        personaName: persona.name,
        score: score?.weightedAverage ?? 0,
        failedChecks,
        notes: score?.notes ?? '',
      });
    }
  }
 
  // 4. Regression detection
  const currentBaseline = buildBaseline(
    results, scores, config.scenarios, config.personas, 'current'
  );
 
  const regression = config.baseline
    ? detectSimulationRegressions(config.baseline, currentBaseline, config.regressionThresholds)
    : null;
 
  // 5. Build report
  const allScores = [...scores.values()].map(s => s.weightedAverage);
 
  return {
    summary: {
      totalSimulations: results.length,
      passRate: allScores.filter(s => s >= config.scorecard.passingThreshold).length / allScores.length,
      overallMean: mean(allScores),
      duration: Date.now() - startTime,
    },
    regression,
    byCategory: currentBaseline.aggregates.byCategory,
    byPersonaType: currentBaseline.aggregates.byPersona,
    failures: failures.sort((a, b) => a.score - b.score),
  };
}

This is the full loop. A developer changes a prompt. CI triggers. The digital twin spins up hundreds of synthetic customers. Each conversation runs against the real agent configuration. The scoring pipeline grades every exchange. The regression detector compares against last week's baseline. The PR gets a green check, yellow warning, or red block — with a detailed breakdown of which persona types and scenario categories degraded.

What does the implementation roadmap look like?

You don't need to build the full framework on day one. Here's a progression that matches investment to maturity:

Week 1-2: Manual simulation. Write 10-15 persona profiles by hand. Define 5 core scenarios. Run simulations manually using the OpenAI playground or a simple script. Score conversations by reading them. This alone will surface failures you haven't seen.

Week 3-4: Automated scoring. Build the scoring pipeline — LLM judge with weighted rubrics plus programmatic checks. Now you can run personas against your agent and get quantified results instead of subjective impressions.

Month 2: Persona generation and scale. Implement the persona matrix generator. Scale from 15 hand-crafted personas to 50+ generated ones. Add pivot point injections to scenarios. You're now testing systematically against customer diversity.

Month 3: CI/CD integration and regression tracking. Wire simulations into your PR pipeline. Build baselines. Detect regressions automatically. At this point, no prompt change ships without simulation coverage.

Ongoing: Production feedback loop. Mine production analytics for new scenario types and persona behaviors your simulation doesn't yet cover. Use conversation monitoring data to keep your simulation realistic as customer behavior evolves.

The teams getting the most value from digital twins aren't the ones with the most sophisticated frameworks. They're the ones who started simple and iterated. Five good personas and a scoring rubric beat an unused thousand-persona system every time.

Key takeaways

Digital twin testing isn't magic. It's disciplined simulation — building the customers your agent will face, running conversations at scale, measuring quality with rubrics, and catching regressions before they reach production. The tooling matters less than the methodology.

Here's what separates teams that catch failures pre-production from teams that discover them in customer complaints:

  1. Personas mapped to failure categories. Every persona probes a specific failure mode. Random personas generate random conversations. Targeted personas generate actionable data.

  2. Scoring with ground truth. Your LLM judge needs the right answers, not just a rubric. Without ground truth, the judge can't distinguish confident hallucination from accurate responses.

  3. Regression tracking over time. A single simulation run is a snapshot. Regression tracking across runs is what tells you whether your agent is getting better or worse — and where.

  4. CI/CD integration. If simulation results don't block deploys, they don't matter. The quality gate is what turns testing from a nice-to-have into a system guarantee.

  5. Production feedback loop. Your simulation is only as good as its scenarios and personas. Continuously update both from real conversation data.

For the scoring methodology deep-dive — LLM-as-judge calibration, rubric anchoring, multi-criteria evaluation — see How to Evaluate AI Agents. For the testing workflow that wraps around scoring — scenario design, edge case generation, regression suites — see AI Agent Testing: How to Evaluate Agents Before Production. For monitoring your agent after it passes the digital twin gate, the patterns in AI Agent Observability connect directly to the metrics we've tracked here.

If building the simulation infrastructure from scratch isn't where you want to spend your engineering time, Chanl's scenario testing and scorecard evaluation handle the orchestration — persona management, parallel simulation, automated scoring, and regression tracking out of the box.

Build the twin. Test at scale. Ship with confidence.

Simulate before you ship

Chanl gives you scenario testing with AI personas, scorecard evaluation, and regression tracking — so your agent faces thousands of synthetic customers before the first real one.

Start building free
LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions