A team at a mid-sized insurance company spent six months building their AI agent. The demo was flawless. Internal testing passed. They launched to real customers.
Within three days, they had a problem: the agent was confidently quoting a refund policy that had been updated two months earlier. Customers were being told they could return products they couldn't actually return. The AI had learned the old policy during training, and nobody had tested whether it would handle the updated version correctly.
The agent passed every test they ran. But the tests didn't cover the scenario that mattered.
This is the production reality gap, and it's where most AI agent projects actually fail. Not in the demo. Not in QA. But in the first week of real traffic.
The production reality gap
Development testing and production testing are fundamentally different problems. Development environments use controlled inputs, clean data, and scenarios your team thought to write. Production brings real customers with typos, unusual accents, emotional states, background noise, and requests that were never in your training data.
Here's what development testing misses:
- The Impatient Customer who interrupts the agent mid-sentence to change their request
- The Confused User who says "uh, actually never mind, can you just..." mid-conversation
- The Policy Edge Case where the customer's situation is technically covered but practically ambiguous
- The 3am Caller who's frustrated before they even start talking
Building a testing framework that catches these gaps before launch isn't optional. It's the difference between a successful deployment and an expensive rollback.
Layer 1: Unit testing the AI components themselves
Unit testing for AI agents is more nuanced than for traditional software. With a REST API, the same input always produces the same output. With an LLM-based agent, you're testing that outputs fall within acceptable ranges, not that they're byte-for-byte identical.
What to test at the unit level:
Intent recognition accuracy
For each intent your agent is designed to handle, test it with at least 15-20 varied phrasings. "I want to cancel my subscription" and "how do I stop being charged?" are the same intent. Your agent needs to recognize both.
Track your recognition accuracy as a percentage. Anything below 90% for a core intent is a problem worth fixing before you ship.
// Example: testing intent recognition across variations
const cancellationPhrases = [
"I want to cancel my subscription",
"how do I stop being charged?",
"I need to end my account",
"cancel everything",
"I don't want this anymore",
"can you stop my payments",
"how do I unsubscribe",
// ... 10+ more
];
for (const phrase of cancellationPhrases) {
const result = await agent.classify(phrase);
expect(result.intent).toBe('cancel_subscription');
expect(result.confidence).toBeGreaterThan(0.85);
}Entity extraction accuracy
If your agent needs to extract values from user input (account numbers, dates, amounts, names), test extraction precision and recall across messy real-world inputs.
"My account number is um, 7-7-4-2... actually it's 7742-9981" is the kind of input your unit tests should handle, not just "my account number is 77429981."
Response quality validation
Don't just test that the agent responds. Test that it responds correctly. Check for:
- Factual accuracy against your knowledge base
- Appropriate tone for the scenario (empathetic for complaints, efficient for simple lookups)
- Absence of off-policy content (things the agent should never say)
- Response length within expected ranges
Automated scoring using an LLM-as-judge pattern works well here: have a separate model evaluate your agent's responses against a rubric. Chanl's AI scorecards automate this evaluation at scale.
Layer 2: Integration testing the full system
Unit tests pass in isolation. Integration tests verify the full system: your agent talking to your actual APIs, knowledge bases, CRM, payment processors, and escalation paths.
Every integration point is a potential failure mode.
Integration test checklist:
End-to-end conversation flows
Map out your 10 most common customer journeys and write scripted integration tests for each. A billing inquiry that resolves successfully. A returns request that escalates to human. A password reset that completes end-to-end.
These tests should run against real (or near-real) infrastructure, not mocks. Mocks can hide integration failures that only surface in production.
Third-party dependency testing
For each external service your agent calls, test both the happy path and failure modes:
| Integration | Happy Path | Failure Mode |
|---|---|---|
| CRM lookup | Returns customer record | Returns 404 → agent asks for info manually |
| Knowledge base | Returns relevant doc | Empty result → agent says "I'm not sure, let me check" |
| Payment API | Confirms transaction | Timeout → agent says "let me retry that" |
| Escalation | Routes to human | Queue full → agent offers callback |
Silent failures are the worst kind. If your CRM returns an error and the agent just continues confidently providing wrong information, that's a production incident waiting to happen.
Escalation path validation
Every escalation scenario should be explicitly tested. When does the agent hand off to a human? What information does it pass along? How does it communicate the handoff to the customer?
The warm handoff is one of the hardest things to get right. "Let me transfer you" followed by immediate disconnect is a well-documented source of customer rage.
Layer 3: Performance testing under load
A 200ms response time at 10 concurrent sessions can become 2,000ms at 100 concurrent sessions. Performance testing isn't optional. It's how you set capacity limits before you need them.
What to measure:
Latency percentiles
Always test p50, p95, and p99 latency, not just averages. Averages hide outliers. If 1% of your conversations take 8 seconds to respond, that's real customers having a terrible experience.
Set thresholds before testing starts. A reasonable baseline for voice AI: p50 < 400ms, p95 < 800ms, p99 < 1,500ms end-to-end response time. Adjust based on your actual customer expectations.
Concurrent session capacity
Run load tests that ramp from 1 to your expected peak concurrency gradually. Find where latency degrades, where errors start appearing, and where the system falls over. That gives you your safe operating limit.
Document this number and build alerting around 70% of capacity, not 100%.
Degradation patterns
The question isn't just "how many sessions can the system handle?" It's "how does quality change as load increases?" Does the model start making more mistakes? Does response coherence drop? Does intent recognition accuracy fall?
If your agent gets dumber under load, you need to know that before your users do.
Layer 4: Chaos testing, deliberately breaking things
Chaos testing is where most AI testing frameworks stop short. It's also where the most expensive production failures come from.
The premise is simple: if you don't test what happens when things go wrong, you'll find out when they actually go wrong. In production, in front of real customers.
Chaos scenarios to test:
Dependency failures
What happens when:
- The knowledge base is returning 503s?
- The CRM lookup takes 30 seconds instead of 300ms?
- The LLM provider's API goes down mid-conversation?
- The escalation queue is full?
For each scenario, verify that the agent either recovers gracefully or fails gracefully, not silently or catastrophically.
Network instability
Voice AI is especially vulnerable to network issues. Test packet loss, increased jitter, and connection drops. What does the agent do when it can't hear the customer clearly? What happens if a tool call hangs for 10 seconds?
Data edge cases
What happens when:
- A customer's account number doesn't exist in the system?
- A product they're asking about was discontinued last week?
- Their name contains characters the system wasn't built to handle?
- They're calling about an order that belongs to a different account?
These aren't hypothetical. They're exactly what real customers do.
Testing personas: the secret weapon
The best way to find failure modes your scripted tests miss is to simulate users who don't behave the way you'd expect.
At Chanl, we've found four personas that expose the most critical gaps:
The Impatient Customer
Interrupts constantly. Asks follow-up questions before the agent finishes its response. Has a problem and wants it solved in 30 seconds. This persona tests your interruption handling and your agent's ability to pick up context after being cut off.
The Confused User
Changes topics mid-sentence. Provides partial information and gets frustrated when asked for more. Says "you know, the thing I ordered last week, no wait, actually it was two weeks ago" and expects the agent to keep up. This tests context tracking and graceful disambiguation.
The Edge Case Explorer
Asks about policies for unusual situations. "Can I return a custom order?" "What if I missed the return window by one day?" "What does your guarantee actually cover?" These conversations expose gaps between what your training data covers and what customers actually ask.
The Frustrated Escalator
Arrives already annoyed. Uses emotionally charged language. Escalates to "I want to talk to a human" within the first exchange. This tests your escalation triggers, your de-escalation attempts, and the quality of the handoff when you don't de-escalate successfully.
Run each persona against each of your major conversation flows. The combinations will surface failure modes your deterministic tests never will.
Building your testing pipeline
You don't have to implement all four layers at once. Here's a practical sequence:
-
Start with unit tests for your highest-traffic intents. If 40% of your calls are billing questions, your billing intent recognition should have comprehensive unit coverage first.
-
Add integration tests for your most critical flows. Usually: successful resolution, escalation to human, and account lookup failure handling.
-
Run performance tests before any major launch. Not just "does it work," but "how does it behave at 10x our expected peak?"
-
Add chaos testing incrementally. Pick your three most likely failure modes and write chaos tests for those first. Expand the library over time.
Regression testing on every merge
Every time you change your agent (prompt update, new tool, model version bump), you're introducing regression risk. The same conversation that worked last week might produce a subtly different (worse) answer today.
Automated regression suites that run on every merge catch these before they reach production. They don't need to be exhaustive. 50-100 representative conversations covering your major flows is often enough to catch significant regressions.
What "production-ready" actually means
A useful heuristic: your agent is production-ready when you can answer yes to all of these:
- Does it handle your 10 most common conversations correctly, consistently, across varied phrasings?
- Does it escalate appropriately when it should, and not escalate when it shouldn't?
- Has it been tested at 2x your expected peak load without latency degradation?
- Does it fail gracefully when any single dependency is unavailable?
- Do you have monitoring in place to catch quality regressions you didn't anticipate?
If any of these is "not yet," that's your testing roadmap.
The goal isn't a perfect agent. It's an agent that fails predictably and recovers gracefully, with a monitoring system that tells you when it's struggling before your customers do.
Getting started this week
If you're shipping an AI agent in the next 30 days and haven't built a systematic testing framework yet, here's the minimum viable version:
- Write 10 scripted conversation tests covering your most critical flows, happy paths and the most obvious failure modes
- Pick your 3 most important intents and test each with 15+ varied phrasings
- Run a basic load test at 2x your expected peak concurrency
- Test your single most critical dependency failure, usually the CRM or knowledge base going down
That won't catch everything. But it'll catch the obvious things, and it gives you a foundation to build on. The testing frameworks that work in production are the ones that start simple and grow with the system, not the ones that try to test everything before a single line ships.
Start there, and iterate.
Engineering Lead
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



