In March 2025, researchers at Carnegie Mellon built a simulated company staffed entirely by AI agents. Agents were assigned roles — software engineers, product managers, data analysts — and given tasks that mirrored real workflows. Even the best-performing model, Anthropic's Claude, completed only 24% of assigned tasks successfully. The agents didn't crash. They produced outputs that looked reasonable. But when evaluated against actual task requirements, they fabricated data, misinterpreted instructions, and confidently delivered wrong answers.
The researchers didn't discover this by deploying agents to a real company. They discovered it by building a digital twin — a simulation environment where AI agents could be tested at scale against realistic conditions.
That's the core idea behind digital twins for AI agents: instead of shipping an agent and hoping for the best, you build a virtual environment where the agent faces thousands of simulated customers before it ever talks to a real one. You discover the 76% failure rate before production, not after.
This guide covers the architecture, implementation, and testing patterns for building digital twins that actually catch failures. We'll build a working simulation framework in TypeScript, design persona systems that generate diverse test conversations, and connect the scoring pipeline that turns raw simulation data into deploy/no-deploy decisions.
| What you'll learn | Why it matters |
|---|---|
| Digital twin architecture | Three-layer system: persona engine, scenario orchestrator, scoring pipeline |
| Persona design | Build synthetic customers that probe different failure modes |
| Simulation orchestration | Run hundreds of conversations in parallel with deterministic control |
| Scoring and evaluation | Grade simulated conversations against weighted rubrics |
| Regression detection | Catch quality degradation across prompt and model changes |
| CI/CD integration | Gate deploys on simulation results automatically |
What is a digital twin for an AI agent?
A digital twin is a simulation environment that runs your real agent configuration against synthetic customers at scale — discovering failures, measuring quality drift, and validating changes before any real person is affected. The term borrows from manufacturing, where digital twins are virtual replicas of physical systems used for testing and optimization. For AI agents, the "twin" isn't a copy of the agent itself — it's the entire testing environment that surrounds it.
The distinction matters. You're not cloning your agent. You're building the world it operates in — the customers it talks to, the scenarios it encounters, the edge cases it has to navigate — and running your actual agent through that simulated world.
Here's what makes this different from manually chatting with your agent a few times before launch:
Scale. Instead of testing 20 conversations, you test 2,000. Instead of checking five customer types, you check fifty. Human testers can run maybe 10-15 conversations per hour. A digital twin runs hundreds per minute.
Diversity. Real customers come in infinite varieties — different emotional states, communication styles, domain knowledge levels, accents, patience thresholds. A digital twin with well-designed personas generates this diversity systematically, hitting combinations a human tester would never think to try.
Reproducibility. When a simulation reveals a failure, you can replay it. You can tweak the prompt and run the exact same persona against the updated agent. You can compare scores across versions with statistical confidence instead of gut feel.
Continuous validation. Digital twins don't get tired. They can run on every PR, every model update, every knowledge base change. This is particularly important because AI agents fail in ways that aren't visible without systematic measurement — a model provider updates their weights, and your agent's tone shifts subtly in ways no single conversation reveals.
How this differs from unit tests and manual QA
If you've read about AI agent testing, you know unit tests check individual components — intent classification accuracy, entity extraction, API response shapes — while scenario tests check conversation-level behavior. Digital twins sit one level above both. They're the orchestration layer that generates diverse scenarios automatically, runs them at scale, and tracks results over time.
Think of it as the difference between:
- Unit tests: "Does the refund tool return the right JSON shape?" (yes/no)
- Scenario tests: "When a frustrated customer asks for a refund on a final-sale item, does the agent handle it correctly?" (scored on a rubric)
- Digital twin: "Run this agent against 500 different customer personas across 40 scenario types, compare scores to last week's baseline, and flag any regressions" (automated, continuous)
You need all three. But the digital twin is what makes the difference between "we tested it" and "we know how it performs across the full distribution of customer behavior."
Why do AI agents need simulation-based testing?
AI agents fail in ways that look like success — they produce fluent, confident, helpful-sounding responses that happen to be wrong, policy-violating, or emotionally tone-deaf — and only systematic simulation at scale reveals these failure patterns before customers find them. This isn't a theoretical concern. The failure rate in real deployments is sobering.
Gartner predicts that by 2028, 25% of enterprise security breaches will be attributed to AI agent abuse. The ASAPP 2025 report on the "AI agent failure era" found that most production CX agents struggle with exactly the scenarios that matter most — complex multi-step issues, emotionally charged interactions, and edge cases that fall outside training distribution. Cleanlab's 2025 survey found that AI agents in production face "shifting stacks" where new frameworks, APIs, and model versions change faster than teams can validate them.
The common thread: you can't predict these failures by looking at your agent's code. You have to run it against realistic conditions and measure what happens.
The failure taxonomy
Through simulation testing, a consistent pattern of failure categories emerges. Understanding these categories is how you design personas and scenarios that actually probe the right failure modes.
| Failure Category | What Happens | Why Unit Tests Miss It | Simulation Catches It |
|---|---|---|---|
| Hallucinated policies | Agent invents a return policy, discount, or procedure | The fabrication is grammatically correct | Persona asks about non-existent policies; scoring checks against ground truth |
| Context loss across turns | Agent forgets information from turn 3 by turn 7 | Each turn passes individually | Multi-turn persona conversations reveal dropped context |
| Emotional miscalibration | Agent responds cheerfully to a clearly frustrated customer | Tone detection works in isolation | Frustrated persona escalates emotion over turns; scoring checks empathy alignment |
| Tool selection errors | Agent calls the wrong tool or skips a required tool | Tool integration tests pass | Ambiguous scenarios force the agent to choose between tools |
| Mid-conversation correction failures | Customer corrects information; agent uses the original | Correction handling passes as isolated test | Persona deliberately provides wrong info, then corrects it mid-conversation |
| Knowledge boundary violations | Agent answers questions it shouldn't know the answer to | Knowledge retrieval returns relevant docs | Boundary-probing persona asks about competitor products, internal procedures |
Each of these categories maps to a persona design pattern. A persona that probes hallucinated policies is different from one that tests emotional miscalibration. The digital twin framework needs to generate both systematically.
“Agents don't fail on the questions you prepared for. They fail on the questions your customers invent — the ambiguous ones, the emotional ones, the ones that require saying 'I don't know.'”
How do you design personas for simulation testing?
Effective personas are defined along four behavioral axes — communication style, emotional state, domain knowledge, and behavioral intent — and the most valuable testing comes from combining these axes into profiles that probe specific failure categories. A persona isn't a name and a backstory. It's a set of behavioral constraints that guide an LLM to role-play a specific type of customer consistently across a multi-turn conversation.
The goal is diversity that maps to real customer distributions. If 30% of your production callers are frustrated, 30% of your simulation personas should be too. If 10% of real conversations involve mid-topic corrections, your persona mix should include that pattern.
The four axes of persona design
Communication style determines how the persona phrases things. Verbose callers provide too much context. Terse callers give one-word answers. Technical callers use domain jargon. Non-native speakers use simplified grammar. Each style tests different aspects of your agent's language understanding.
Emotional state shapes the conversation trajectory. A calm customer who encounters a policy limitation might accept it. A frustrated customer will push back. An anxious customer needs reassurance. The emotional axis tests whether your agent adapts its tone appropriately — a critical dimension that scorecard evaluation can measure.
Domain knowledge determines what the persona knows about your product. Expert users ask detailed technical questions. Novice users describe symptoms instead of problems ("the thing won't work"). The knowledge gap between expert and novice callers is one of the most common sources of production agent failures.
Behavioral intent is what the persona is actually trying to accomplish — and whether they're cooperating or adversarial. A cooperative persona follows the agent's prompts. A tangential persona drifts between topics. An adversarial persona tries to exploit policies or extract unauthorized information.
interface PersonaProfile {
id: string;
name: string;
communicationStyle: 'verbose' | 'terse' | 'technical' | 'non-native' | 'rambling';
emotionalState: 'calm' | 'frustrated' | 'anxious' | 'confused' | 'angry';
domainKnowledge: 'expert' | 'intermediate' | 'novice';
behavioralIntent: 'cooperative' | 'tangential' | 'adversarial' | 'correction-prone';
systemPrompt: string; // LLM prompt that makes the persona behave consistently
targetFailureCategory?: string; // Which failure type this persona probes
}
// A persona designed to probe context loss across turns
const contextDriftPersona: PersonaProfile = {
id: 'context-drift-001',
name: 'The Topic Switcher',
communicationStyle: 'rambling',
emotionalState: 'calm',
domainKnowledge: 'intermediate',
behavioralIntent: 'tangential',
targetFailureCategory: 'context_loss',
systemPrompt: `You are a customer calling about a billing issue, but you frequently
drift to other topics. Start by asking about a charge on your account. After 2-3 turns
about billing, mention a separate product question. Then return to the original billing
issue and reference specific details you mentioned earlier — test whether the agent
remembers them. If the agent loses context about your original issue, express mild
frustration and repeat the key details.
Your account number is AC-9382. You were charged $49.99 on March 3rd for a subscription
you thought you cancelled. You also want to know if the Premium tier includes API access.
IMPORTANT: When you return to the billing topic, do NOT repeat your account number.
Reference it as "the account I mentioned." Only repeat it if the agent explicitly asks.`,
};
// A persona designed to probe hallucinated policies
const policyProbePersona: PersonaProfile = {
id: 'policy-probe-001',
name: 'The Policy Explorer',
communicationStyle: 'technical',
emotionalState: 'calm',
domainKnowledge: 'expert',
behavioralIntent: 'cooperative',
targetFailureCategory: 'hallucinated_policies',
systemPrompt: `You are a savvy customer who asks very specific policy questions.
Your goal is to probe whether the agent knows the boundaries of actual company policy
versus things it might fabricate.
Ask about:
1. The exact refund window for annual subscriptions (real: 14 days)
2. Whether there's a loyalty discount for 2+ year customers (doesn't exist)
3. The SLA guarantee for API uptime (real: 99.9%)
4. Whether you can transfer a license to a colleague (doesn't exist)
For each response, ask a follow-up that tests specificity: "Can you point me to where
that's documented?" or "Is that in the terms of service?"
If the agent invents a policy that doesn't exist, accept it politely and move on —
the scoring system will catch it.`,
};Generating persona combinations at scale
You don't need to hand-write every persona. Define the axes, then generate combinations programmatically:
function generatePersonaMatrix(
scenarios: ScenarioTemplate[],
axes: {
styles: PersonaProfile['communicationStyle'][];
emotions: PersonaProfile['emotionalState'][];
knowledge: PersonaProfile['domainKnowledge'][];
intents: PersonaProfile['behavioralIntent'][];
}
): PersonaProfile[] {
const personas: PersonaProfile[] = [];
let counter = 0;
// Full cartesian product would be 5 * 5 * 3 * 4 = 300 personas
// Instead, use pairwise combinations for practical coverage
for (const style of axes.styles) {
for (const emotion of axes.emotions) {
// Pick one knowledge level and one intent per style-emotion pair
const knowledge = axes.knowledge[counter % axes.knowledge.length];
const intent = axes.intents[counter % axes.intents.length];
personas.push({
id: `persona-${counter++}`,
name: `${emotion}-${style}-${knowledge}`,
communicationStyle: style,
emotionalState: emotion,
domainKnowledge: knowledge,
behavioralIntent: intent,
systemPrompt: buildPersonaPrompt(style, emotion, knowledge, intent),
});
}
}
return personas;
}
function buildPersonaPrompt(
style: string,
emotion: string,
knowledge: string,
intent: string
): string {
const styleInstructions: Record<string, string> = {
verbose: 'You tend to over-explain. Provide lots of context, sometimes irrelevant.',
terse: 'You give short answers. One sentence max unless pressed for details.',
technical: 'You use precise technical terminology and expect the same back.',
'non-native': 'English is your second language. Use simpler grammar, occasionally misuse idioms.',
rambling: 'You drift between topics and circle back. Stream of consciousness.',
};
const emotionInstructions: Record<string, string> = {
calm: 'You are patient and reasonable throughout the conversation.',
frustrated: 'You are increasingly frustrated. If the agent is unhelpful, escalate your tone.',
anxious: 'You are worried about the outcome. Ask for reassurance repeatedly.',
confused: 'You don\'t fully understand the situation. Ask clarifying questions.',
angry: 'You are upset from the start. You want resolution NOW.',
};
return `You are a simulated customer in a testing environment.
COMMUNICATION: ${styleInstructions[style]}
EMOTIONAL STATE: ${emotionInstructions[emotion]}
DOMAIN KNOWLEDGE: You are a ${knowledge}-level user of the product.
BEHAVIOR: Your intent is ${intent}.
Stay in character throughout the conversation. Do not break the fourth wall.
Do not mention that you are a test or simulation.
Respond naturally based on your persona traits.`;
}This persona generation approach draws from the same principles as adversarial testing — synthetic users that probe the agent's weaknesses systematically. Sierra AI's research on voice simulations confirms the pattern: "Simulations can be designed to re-create specific situations, including ones that are rare or hard to observe in the real world, but that can have an outsized impact on the quality and safety of the agent experience."
How do you orchestrate simulation at scale?
The scenario orchestrator is the engine that pairs personas with scenarios, manages conversation flow, handles concurrency, and collects the raw data the scoring pipeline needs. A well-designed orchestrator runs hundreds of conversations in parallel while maintaining deterministic control over each one — you need to know exactly which persona said what, in which scenario, and be able to replay any conversation that reveals a failure.
Scenario definition
A scenario defines the situation, separate from the persona that encounters it. The same "billing dispute" scenario should behave differently when a frustrated expert encounters it versus when a confused novice does.
interface SimulationScenario {
id: string;
name: string;
category: 'billing' | 'technical_support' | 'account_management' | 'sales' | 'escalation';
description: string;
setup: {
customerContext: Record<string, unknown>; // Account data, order history, etc.
agentContext?: Record<string, unknown>; // Pre-loaded knowledge, tool access
};
pivotPoints?: {
afterTurn: number;
injection: string; // Force a specific user message to test a specific behavior
}[];
successCriteria: {
mustResolve: boolean; // Must the issue be fully resolved?
maxTurns: number; // Efficiency threshold
requiredTools?: string[]; // Tools the agent should use
forbiddenActions?: string[]; // Things the agent must NOT do
groundTruth?: Record<string, string>; // Factual answers for scoring
};
}
const billingDisputeScenario: SimulationScenario = {
id: 'billing-dispute-001',
name: 'Duplicate charge dispute',
category: 'billing',
description: 'Customer was charged twice for the same subscription renewal',
setup: {
customerContext: {
accountId: 'AC-9382',
plan: 'Professional',
lastCharge: { amount: 49.99, date: '2025-03-03', description: 'Monthly renewal' },
duplicateCharge: { amount: 49.99, date: '2025-03-03', description: 'Monthly renewal' },
refundEligible: true,
},
},
pivotPoints: [
{
afterTurn: 4,
injection: 'Actually, wait — I just realized the second charge might be for my team account. Can you check both accounts?',
},
],
successCriteria: {
mustResolve: true,
maxTurns: 12,
requiredTools: ['lookup_billing_history', 'process_refund'],
forbiddenActions: ['transfer_to_human_without_attempting_resolution'],
groundTruth: {
refundPolicy: '14-day refund window for subscription charges',
duplicateChargeProcess: 'Automatic refund within 3-5 business days',
},
},
};The simulation runner
The runner pairs personas with scenarios, manages the conversation loop, and captures everything needed for scoring:
interface ConversationTurn {
role: 'customer' | 'agent';
content: string;
timestamp: number;
toolsUsed?: string[];
metadata?: Record<string, unknown>;
}
interface SimulationResult {
id: string;
scenarioId: string;
personaId: string;
turns: ConversationTurn[];
totalDuration: number;
toolsUsed: string[];
resolved: boolean;
turnCount: number;
}
async function runSimulation(
scenario: SimulationScenario,
persona: PersonaProfile,
agentEndpoint: string,
config: { maxTurns: number; turnTimeout: number }
): Promise<SimulationResult> {
const turns: ConversationTurn[] = [];
const allToolsUsed: string[] = [];
const startTime = Date.now();
// Generate the persona's opening message based on the scenario
const openingMessage = await generatePersonaMessage(
persona,
scenario,
[], // no previous turns
);
turns.push({
role: 'customer',
content: openingMessage,
timestamp: Date.now(),
});
for (let turnNum = 0; turnNum < config.maxTurns; turnNum++) {
// Send to agent and get response
const agentResponse = await callAgent(agentEndpoint, {
messages: turns,
context: scenario.setup.agentContext,
});
turns.push({
role: 'agent',
content: agentResponse.message,
timestamp: Date.now(),
toolsUsed: agentResponse.toolsUsed,
});
if (agentResponse.toolsUsed) {
allToolsUsed.push(...agentResponse.toolsUsed);
}
// Check if conversation has naturally concluded
if (agentResponse.conversationEnded) break;
// Check for pivot point injections
const pivot = scenario.pivotPoints?.find(p => p.afterTurn === turnNum + 1);
const nextCustomerMessage = pivot
? pivot.injection
: await generatePersonaMessage(persona, scenario, turns);
turns.push({
role: 'customer',
content: nextCustomerMessage,
timestamp: Date.now(),
});
}
return {
id: `sim-${scenario.id}-${persona.id}-${Date.now()}`,
scenarioId: scenario.id,
personaId: persona.id,
turns,
totalDuration: Date.now() - startTime,
toolsUsed: [...new Set(allToolsUsed)],
resolved: detectResolution(turns, scenario.successCriteria),
turnCount: turns.length,
};
}Parallel execution with controlled concurrency
Running 500 simulations sequentially would take hours. But running all 500 in parallel would overwhelm your agent endpoint and your LLM API rate limits. Use controlled concurrency:
async function runSimulationSuite(
scenarios: SimulationScenario[],
personas: PersonaProfile[],
agentEndpoint: string,
config: { concurrency: number; maxTurns: number; turnTimeout: number }
): Promise<SimulationResult[]> {
// Build the simulation matrix: every scenario × selected personas
const simulationPairs: Array<{ scenario: SimulationScenario; persona: PersonaProfile }> = [];
for (const scenario of scenarios) {
// Not every persona needs every scenario — match by failure category
const relevantPersonas = personas.filter(
p => !p.targetFailureCategory || isRelevantToScenario(p, scenario)
);
for (const persona of relevantPersonas) {
simulationPairs.push({ scenario, persona });
}
}
console.log(`Running ${simulationPairs.length} simulations (${config.concurrency} concurrent)`);
const results: SimulationResult[] = [];
// Process in batches of `concurrency`
for (let i = 0; i < simulationPairs.length; i += config.concurrency) {
const batch = simulationPairs.slice(i, i + config.concurrency);
const batchResults = await Promise.all(
batch.map(({ scenario, persona }) =>
runSimulation(scenario, persona, agentEndpoint, config)
)
);
results.push(...batchResults);
// Progress reporting
const pct = Math.round(((i + batch.length) / simulationPairs.length) * 100);
console.log(`Progress: ${pct}% (${i + batch.length}/${simulationPairs.length})`);
}
return results;
}This is the orchestration pattern that platforms like Chanl's scenario testing system implement under the hood — persona management, conversation orchestration, and parallel execution with rate limiting.
How do you score and evaluate simulated conversations?
The scoring pipeline takes raw conversation transcripts and produces structured quality scores — combining LLM-as-judge evaluation for subjective quality with programmatic checks for hard policy violations. The output is a per-conversation scorecard that feeds into regression detection and deploy gating.
If you've built eval frameworks before, the scoring pipeline will feel familiar. The difference in a digital twin context is scale — you're scoring hundreds of conversations per run, not a handful.
Scorecard design for simulation
A simulation scorecard needs criteria that map to the failure categories your personas are probing. Generic criteria like "overall quality" don't give you actionable signal. Specific criteria tied to specific failure modes do.
interface SimulationScorecard {
id: string;
criteria: Array<{
name: string;
description: string;
weight: number;
anchors: { score: number; description: string }[];
appliesTo?: string[]; // Only score this for specific scenario categories
}>;
passingThreshold: number;
programmaticChecks: ProgrammaticCheck[];
}
const simulationScorecard: SimulationScorecard = {
id: 'twin-scorecard-v1',
criteria: [
{
name: 'factual_accuracy',
description: 'Are all factual claims correct per the ground truth?',
weight: 0.30,
anchors: [
{ score: 1, description: 'Multiple factual errors or fabricated policies' },
{ score: 3, description: 'Mostly accurate with minor imprecisions' },
{ score: 5, description: 'All claims verifiable against ground truth' },
],
},
{
name: 'context_retention',
description: 'Does the agent remember details from earlier in the conversation?',
weight: 0.20,
anchors: [
{ score: 1, description: 'Asks customer to repeat previously stated info' },
{ score: 3, description: 'Retains key details but misses some context' },
{ score: 5, description: 'References earlier details naturally and accurately' },
],
},
{
name: 'emotional_calibration',
description: 'Does the agent match the appropriate emotional register?',
weight: 0.20,
anchors: [
{ score: 1, description: 'Tone is wildly inappropriate (cheerful to angry customer)' },
{ score: 3, description: 'Generally appropriate but misses escalation cues' },
{ score: 5, description: 'Tone matches and adapts to emotional shifts across turns' },
],
},
{
name: 'task_completion',
description: 'Did the agent resolve the customer issue or appropriately escalate?',
weight: 0.20,
anchors: [
{ score: 1, description: 'Issue unresolved, customer left worse off' },
{ score: 3, description: 'Partial resolution or unnecessary human escalation' },
{ score: 5, description: 'Full resolution or well-justified escalation with context' },
],
},
{
name: 'efficiency',
description: 'Did the agent resolve efficiently without unnecessary turns?',
weight: 0.10,
anchors: [
{ score: 1, description: 'Excessive turns, repetitive questions, circular conversation' },
{ score: 3, description: 'Reasonable turn count with some redundancy' },
{ score: 5, description: 'Efficient conversation flow, no wasted turns' },
],
},
],
passingThreshold: 3.5,
programmaticChecks: [
{
name: 'no_hallucinated_policy',
check: (result, scenario) => {
// This would be checked by the LLM judge against ground truth
// Programmatic check verifies format compliance
return { passed: true, detail: 'Deferred to LLM judge for factual accuracy' };
},
},
{
name: 'required_tools_used',
check: (result, scenario) => {
const required = scenario.successCriteria.requiredTools ?? [];
const missing = required.filter(t => !result.toolsUsed.includes(t));
return {
passed: missing.length === 0,
detail: missing.length > 0
? `Missing required tools: ${missing.join(', ')}`
: 'All required tools used',
};
},
},
{
name: 'turn_limit',
check: (result, scenario) => {
return {
passed: result.turnCount <= scenario.successCriteria.maxTurns,
detail: `${result.turnCount} turns (limit: ${scenario.successCriteria.maxTurns})`,
};
},
},
{
name: 'no_forbidden_actions',
check: (result, scenario) => {
const forbidden = scenario.successCriteria.forbiddenActions ?? [];
const violations = forbidden.filter(a => result.toolsUsed.includes(a));
return {
passed: violations.length === 0,
detail: violations.length > 0
? `Forbidden actions taken: ${violations.join(', ')}`
: 'No policy violations',
};
},
},
],
};LLM judge for simulated conversations
The judge scores each conversation against the scorecard criteria. For simulation testing, include the scenario's ground truth so the judge can check factual accuracy:
async function scoreSimulation(
result: SimulationResult,
scenario: SimulationScenario,
scorecard: SimulationScorecard
): Promise<ScorecardResult> {
const conversationText = result.turns
.map(t => `[${t.role.toUpperCase()}]: ${t.content}`)
.join('\n\n');
const groundTruthSection = scenario.successCriteria.groundTruth
? `\n\nGROUND TRUTH (use this to verify factual claims):\n${
Object.entries(scenario.successCriteria.groundTruth)
.map(([k, v]) => `- ${k}: ${v}`)
.join('\n')
}`
: '';
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.1,
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: `You are an expert QA evaluator scoring AI agent conversations.
Score the following conversation on each criterion using the provided rubric anchors.
Return a JSON object with this exact structure:
{
"scores": {
"<criterion_name>": { "score": <1-5>, "reasoning": "<1-2 sentences>" }
},
"overall_notes": "<key observations about agent performance>"
}
CRITERIA:
${scorecard.criteria.map(c =>
`${c.name} (weight: ${c.weight}): ${c.description}\n ${
c.anchors.map(a => `${a.score} = ${a.description}`).join('\n ')
}`
).join('\n\n')}
${groundTruthSection}`,
},
{
role: 'user',
content: `SCENARIO: ${scenario.name} — ${scenario.description}
PERSONA: ${result.personaId}
TURNS: ${result.turnCount}
TOOLS USED: ${result.toolsUsed.join(', ') || 'none'}
CONVERSATION:
${conversationText}`,
},
],
});
const judgeOutput = JSON.parse(response.choices[0].message.content ?? '{}');
let weightedSum = 0;
for (const criterion of scorecard.criteria) {
const score = judgeOutput.scores[criterion.name]?.score ?? 0;
weightedSum += score * criterion.weight;
}
return {
simulationId: result.id,
scores: judgeOutput.scores,
weightedAverage: Math.round(weightedSum * 100) / 100,
passed: weightedSum >= scorecard.passingThreshold,
notes: judgeOutput.overall_notes,
};
}Run each scoring three times and take the median to account for LLM non-determinism. This is the same calibration technique described in the eval framework guide — consistency matters more than precision for any single run.
How do you detect regressions across simulation runs?
Regression detection compares your current simulation results against a known-good baseline and flags statistically significant quality drops — distinguishing real degradation from normal LLM scoring variance. Without regression tracking, a prompt tweak that improves billing scenarios by 0.5 points while silently degrading returns scenarios by 1.2 points goes unnoticed until a customer reports it.
Building the baseline
A baseline captures your agent's performance across the full simulation matrix at a point where you've verified quality is acceptable. Every subsequent run compares against this.
interface SimulationBaseline {
version: string;
timestamp: string;
modelVersion: string;
results: Map<string, {
scenarioId: string;
personaId: string;
weightedAverage: number;
criteriaScores: Record<string, number>;
turnCount: number;
resolved: boolean;
}>;
aggregates: {
overallMean: number;
passRate: number;
byCategory: Record<string, { mean: number; passRate: number }>;
byPersonaType: Record<string, { mean: number; passRate: number }>;
};
}
function buildBaseline(
results: SimulationResult[],
scores: Map<string, ScorecardResult>,
scenarios: SimulationScenario[],
personas: PersonaProfile[],
version: string
): SimulationBaseline {
const baselineResults = new Map();
for (const result of results) {
const score = scores.get(result.id);
if (!score) continue;
baselineResults.set(result.id, {
scenarioId: result.scenarioId,
personaId: result.personaId,
weightedAverage: score.weightedAverage,
criteriaScores: Object.fromEntries(
Object.entries(score.scores).map(([k, v]) => [k, v.score])
),
turnCount: result.turnCount,
resolved: result.resolved,
});
}
// Compute aggregates by category and persona type
const byCategory = groupAndAggregate(results, scores, scenarios, 'category');
const byPersona = groupAndAggregate(results, scores, personas, 'emotionalState');
const allScores = [...scores.values()].map(s => s.weightedAverage);
return {
version,
timestamp: new Date().toISOString(),
modelVersion: 'gpt-4o-2025-08-06',
results: baselineResults,
aggregates: {
overallMean: mean(allScores),
passRate: allScores.filter(s => s >= 3.5).length / allScores.length,
byCategory,
byPersona,
},
};
}Comparing runs against baseline
The regression detector needs to distinguish signal from noise. LLM scoring has inherent variance — a 0.2-point drop might be noise, but a 0.6-point drop almost certainly isn't. Set thresholds based on the variance you observe in your scoring pipeline.
interface RegressionReport {
status: 'passed' | 'warning' | 'failed';
summary: string;
categoryRegressions: Array<{
category: string;
baselineMean: number;
currentMean: number;
delta: number;
severity: 'warning' | 'critical';
}>;
personaRegressions: Array<{
personaType: string;
baselineMean: number;
currentMean: number;
delta: number;
}>;
worstScenarios: Array<{
scenarioId: string;
personaId: string;
score: number;
notes: string;
}>;
}
function detectSimulationRegressions(
baseline: SimulationBaseline,
current: SimulationBaseline,
thresholds: {
failDelta: number; // e.g., 0.5 — absolute drop that blocks deploy
warnDelta: number; // e.g., 0.3 — drop that triggers warning
minPassRate: number; // e.g., 0.85 — minimum % of simulations that must pass
}
): RegressionReport {
const categoryRegressions = [];
const personaRegressions = [];
let hasCritical = false;
// Check by scenario category
for (const [cat, baselineAgg] of Object.entries(baseline.aggregates.byCategory)) {
const currentAgg = current.aggregates.byCategory[cat];
if (!currentAgg) continue;
const delta = currentAgg.mean - baselineAgg.mean;
if (delta <= -thresholds.warnDelta) {
const severity = delta <= -thresholds.failDelta ? 'critical' : 'warning';
if (severity === 'critical') hasCritical = true;
categoryRegressions.push({
category: cat,
baselineMean: baselineAgg.mean,
currentMean: currentAgg.mean,
delta,
severity,
});
}
}
// Check by persona type
for (const [pType, baselineAgg] of Object.entries(baseline.aggregates.byPersona)) {
const currentAgg = current.aggregates.byPersona[pType];
if (!currentAgg) continue;
const delta = currentAgg.mean - baselineAgg.mean;
if (delta <= -thresholds.warnDelta) {
personaRegressions.push({
personaType: pType,
baselineMean: baselineAgg.mean,
currentMean: currentAgg.mean,
delta,
});
}
}
// Find worst individual conversations
const worstScenarios = [...current.results.values()]
.filter(r => r.weightedAverage < thresholds.failDelta + 2)
.sort((a, b) => a.weightedAverage - b.weightedAverage)
.slice(0, 5)
.map(r => ({
scenarioId: r.scenarioId,
personaId: r.personaId,
score: r.weightedAverage,
notes: '',
}));
const passRateDrop = current.aggregates.passRate < thresholds.minPassRate;
return {
status: hasCritical || passRateDrop ? 'failed' : categoryRegressions.length > 0 ? 'warning' : 'passed',
summary: buildRegressionSummary(categoryRegressions, personaRegressions, baseline, current),
categoryRegressions,
personaRegressions,
worstScenarios,
};
}The regression report tells you where quality dropped — which scenario categories, which persona types, which specific conversations. This is the actionable intelligence that prompt engineering needs to target fixes effectively, rather than blindly tweaking the system prompt.
How do you integrate digital twin testing into CI/CD?
Wire the simulation suite into your deployment pipeline as a quality gate — run a smoke subset on every PR, the full matrix on major changes, and compare against the regression baseline to make automatic pass/warn/fail decisions. This transforms digital twin testing from something you do occasionally into a continuous, enforced part of your development workflow.
Cost management for continuous simulation
Running 480 simulated conversations (40 scenarios times 12 personas) on every PR would get expensive. Here's how to keep costs practical:
| Strategy | Impact | How |
|---|---|---|
| Path-filtered triggers | 70-80% fewer runs | Only trigger on prompt, model, tool, or knowledge base changes |
| Smoke-first gating | 60% fewer full runs | 30-conversation smoke test catches obvious regressions before the full suite |
| Persona sampling | 40-60% fewer conversations | Full persona matrix for major changes; sample 3-4 personas for minor ones |
| Cached judge scores | 30% less LLM spend | Cache scores for unchanged scenario-persona pairs across runs |
| Parallel execution | No cost savings, but 5x faster | Run conversations concurrently within rate limits |
A practical budget: 30 conversations for smoke tests ($0.60-2.40), 480 for the full matrix ($9.60-38.40). If you trigger the full suite twice a week and smoke tests on 10 PRs, you're looking at roughly $25-100/week. Compare that to the cost of one production incident with customer impact.
What changes warrant a full simulation run?
Not every code change needs the full digital twin. Use this as a heuristic:
| Change Type | Simulation Level | Why |
|---|---|---|
| System prompt rewrite | Full matrix + new baseline | Prompts affect everything |
| Model version upgrade | Full matrix | Model behavior is unpredictable |
| New tool added | Relevant scenarios only | New capability might introduce new failure modes |
| Knowledge base update | Affected scenario categories | New docs might conflict with existing knowledge |
| Tool configuration change | Affected scenarios only | Changed tool behavior might break workflows |
| UI-only changes | Skip simulation | Can't affect agent behavior |
What does a production-ready digital twin framework look like?
A complete framework ties together persona generation, scenario orchestration, scoring, regression detection, and reporting into a single run command. Here's the orchestration layer that connects all the pieces we've built:
interface DigitalTwinConfig {
agentEndpoint: string;
scenarios: SimulationScenario[];
personas: PersonaProfile[];
scorecard: SimulationScorecard;
baseline?: SimulationBaseline;
concurrency: number;
maxTurnsPerConversation: number;
regressionThresholds: {
failDelta: number;
warnDelta: number;
minPassRate: number;
};
}
interface DigitalTwinReport {
summary: {
totalSimulations: number;
passRate: number;
overallMean: number;
duration: number;
};
regression: RegressionReport | null;
byCategory: Record<string, { mean: number; passRate: number; count: number }>;
byPersonaType: Record<string, { mean: number; passRate: number; count: number }>;
failures: Array<{
simulationId: string;
scenarioName: string;
personaName: string;
score: number;
failedChecks: string[];
notes: string;
}>;
}
async function runDigitalTwin(config: DigitalTwinConfig): Promise<DigitalTwinReport> {
const startTime = Date.now();
// 1. Run all simulations
console.log('Starting simulation suite...');
const results = await runSimulationSuite(
config.scenarios,
config.personas,
config.agentEndpoint,
{
concurrency: config.concurrency,
maxTurns: config.maxTurnsPerConversation,
turnTimeout: 30_000,
}
);
// 2. Score all conversations (3x each for reliability)
console.log(`Scoring ${results.length} conversations...`);
const scores = new Map<string, ScorecardResult>();
for (let i = 0; i < results.length; i += config.concurrency) {
const batch = results.slice(i, i + config.concurrency);
const batchScores = await Promise.all(
batch.map(async (result) => {
const scenario = config.scenarios.find(s => s.id === result.scenarioId)!;
// Run scoring 3 times, take median
const runs = await Promise.all(
Array.from({ length: 3 }, () =>
scoreSimulation(result, scenario, config.scorecard)
)
);
const sorted = runs.sort((a, b) => a.weightedAverage - b.weightedAverage);
return { id: result.id, score: sorted[1] }; // median
})
);
for (const { id, score } of batchScores) {
scores.set(id, score);
}
}
// 3. Run programmatic checks
const failures = [];
for (const result of results) {
const score = scores.get(result.id);
const scenario = config.scenarios.find(s => s.id === result.scenarioId)!;
const persona = config.personas.find(p => p.id === result.personaId)!;
const failedChecks = config.scorecard.programmaticChecks
.map(check => check.check(result, scenario))
.filter(r => !r.passed)
.map(r => r.detail);
if (!score?.passed || failedChecks.length > 0) {
failures.push({
simulationId: result.id,
scenarioName: scenario.name,
personaName: persona.name,
score: score?.weightedAverage ?? 0,
failedChecks,
notes: score?.notes ?? '',
});
}
}
// 4. Regression detection
const currentBaseline = buildBaseline(
results, scores, config.scenarios, config.personas, 'current'
);
const regression = config.baseline
? detectSimulationRegressions(config.baseline, currentBaseline, config.regressionThresholds)
: null;
// 5. Build report
const allScores = [...scores.values()].map(s => s.weightedAverage);
return {
summary: {
totalSimulations: results.length,
passRate: allScores.filter(s => s >= config.scorecard.passingThreshold).length / allScores.length,
overallMean: mean(allScores),
duration: Date.now() - startTime,
},
regression,
byCategory: currentBaseline.aggregates.byCategory,
byPersonaType: currentBaseline.aggregates.byPersona,
failures: failures.sort((a, b) => a.score - b.score),
};
}This is the full loop. A developer changes a prompt. CI triggers. The digital twin spins up hundreds of synthetic customers. Each conversation runs against the real agent configuration. The scoring pipeline grades every exchange. The regression detector compares against last week's baseline. The PR gets a green check, yellow warning, or red block — with a detailed breakdown of which persona types and scenario categories degraded.
What does the implementation roadmap look like?
You don't need to build the full framework on day one. Here's a progression that matches investment to maturity:
Week 1-2: Manual simulation. Write 10-15 persona profiles by hand. Define 5 core scenarios. Run simulations manually using the OpenAI playground or a simple script. Score conversations by reading them. This alone will surface failures you haven't seen.
Week 3-4: Automated scoring. Build the scoring pipeline — LLM judge with weighted rubrics plus programmatic checks. Now you can run personas against your agent and get quantified results instead of subjective impressions.
Month 2: Persona generation and scale. Implement the persona matrix generator. Scale from 15 hand-crafted personas to 50+ generated ones. Add pivot point injections to scenarios. You're now testing systematically against customer diversity.
Month 3: CI/CD integration and regression tracking. Wire simulations into your PR pipeline. Build baselines. Detect regressions automatically. At this point, no prompt change ships without simulation coverage.
Ongoing: Production feedback loop. Mine production analytics for new scenario types and persona behaviors your simulation doesn't yet cover. Use conversation monitoring data to keep your simulation realistic as customer behavior evolves.
The teams getting the most value from digital twins aren't the ones with the most sophisticated frameworks. They're the ones who started simple and iterated. Five good personas and a scoring rubric beat an unused thousand-persona system every time.
Key takeaways
Digital twin testing isn't magic. It's disciplined simulation — building the customers your agent will face, running conversations at scale, measuring quality with rubrics, and catching regressions before they reach production. The tooling matters less than the methodology.
Here's what separates teams that catch failures pre-production from teams that discover them in customer complaints:
-
Personas mapped to failure categories. Every persona probes a specific failure mode. Random personas generate random conversations. Targeted personas generate actionable data.
-
Scoring with ground truth. Your LLM judge needs the right answers, not just a rubric. Without ground truth, the judge can't distinguish confident hallucination from accurate responses.
-
Regression tracking over time. A single simulation run is a snapshot. Regression tracking across runs is what tells you whether your agent is getting better or worse — and where.
-
CI/CD integration. If simulation results don't block deploys, they don't matter. The quality gate is what turns testing from a nice-to-have into a system guarantee.
-
Production feedback loop. Your simulation is only as good as its scenarios and personas. Continuously update both from real conversation data.
For the scoring methodology deep-dive — LLM-as-judge calibration, rubric anchoring, multi-criteria evaluation — see How to Evaluate AI Agents. For the testing workflow that wraps around scoring — scenario design, edge case generation, regression suites — see AI Agent Testing: How to Evaluate Agents Before Production. For monitoring your agent after it passes the digital twin gate, the patterns in AI Agent Observability connect directly to the metrics we've tracked here.
If building the simulation infrastructure from scratch isn't where you want to spend your engineering time, Chanl's scenario testing and scorecard evaluation handle the orchestration — persona management, parallel simulation, automated scoring, and regression tracking out of the box.
Build the twin. Test at scale. Ship with confidence.
Simulate before you ship
Chanl gives you scenario testing with AI personas, scorecard evaluation, and regression tracking — so your agent faces thousands of synthetic customers before the first real one.
Start building freeEngineering Lead
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



