A team upgrades their customer service agent to the latest model. Benchmarks improve 15%. Reasoning scores jump. The model handles longer conversations, catches more nuance, and the eval suite lights up green.
Customer complaints increase 23% in the first week.
Not because the new model is worse. It's measurably better at almost everything. It just isn't measurably better at doing the same thing the same way twice. Monday's answer to "can I get a refund?" is different from Thursday's. Not wrong, exactly. Just different enough that the operations team can't predict what customers will hear.
This is the reliability-capability gap. And the research says it's getting wider, not narrower.
Table of contents
- The gap the benchmarks hide
- The compounding problem
- Beyond accuracy: the CLEAR framework
- pass@k vs pass^k: what consistency actually means
- Four dimensions of reliability
- Measuring reliability in practice
- Building a reliability testing workflow
- Where this is heading
The gap the benchmarks hide
Reliability improves at half the rate of accuracy. On customer service tasks specifically, it improves at one-seventh the rate. That's the finding from Princeton researchers Sayash Kapoor and Arvind Narayanan, published in March 2026, and it reframes the entire conversation about model upgrades.
Their analysis traced how frontier models have evolved over the past two years. On standard benchmarks, accuracy has climbed steadily. Models solve harder problems, handle longer contexts, and demonstrate more sophisticated reasoning with each generation. But when the researchers measured whether those same models give the same correct answer when asked the same question multiple times, the improvement curve flattened.
The numbers are stark. Claude Opus 4.5, one of the strongest models available, achieves only 73% consistency across repeated runs. Gemini 3 Pro reaches just 52% accuracy when judging its own outputs, and manages to avoid catastrophic errors only 25% of the time. These aren't cherry-picked failure cases. They're measurements of the best models we have, running on controlled tasks, under ideal conditions.
Paper: "AI's Reliability Crisis" (Kapoor & Narayanan, Princeton) Reliability improves at half the rate of accuracy on general benchmarks, and at one-seventh the rate on customer service tasks. Read the coverage (Fortune, March 2026) ->
The implication for production teams is uncomfortable. Every model upgrade that improves accuracy by, say, 10 points might improve reliability by 5 points. For customer-facing agents, that ratio drops to 10:1.4. You're making your agent smarter while barely making it more dependable.
This matters because users don't experience accuracy. They experience consistency. A customer who gets a correct-but-different answer each time they call doesn't think "the model is accurate." They think "I can't trust this system."
The compounding problem
The reliability gap gets worse at scale because real systems chain multiple components together, and probabilities compound. This is where the math turns unforgiving.
Kapoor and Narayanan's research included a case study from medical AI. Three diagnostic tools, each individually impressive: 90% accuracy, 85% accuracy, and 97% accuracy. Used alone, any of them would seem reliable. Used together in a pipeline where all three need to be correct, the combined reliability was 74%.
The multiplication is simple:
0.90 x 0.85 x 0.97 = 0.74That's a 26% failure rate from a system where every individual component works at least 85% of the time. And real agent pipelines have more than three components.
Consider a typical production agent handling a customer request. The query hits a router (98% accuracy). The router selects the right tool (92% accuracy). The tool executes correctly (95% accuracy). The agent interprets the result (90% accuracy). The agent formulates a response (93% accuracy). The response passes safety filters (99% accuracy).
Each step seems fine. Multiplied together:
0.98 x 0.92 x 0.95 x 0.90 x 0.93 x 0.99 = 0.70A 30% end-to-end failure rate from components that are each individually above 90%. This isn't a theoretical exercise. It's the math behind why teams see 85% accuracy on unit tests and 65% success rates in production.
The gap between component-level metrics and system-level reliability is the first thing teams discover when they start measuring end-to-end. And the more capable your agent is (more tools, more steps, more reasoning chains), the more components are in the pipeline, and the worse the compounding gets.
This creates a paradox: adding capabilities can reduce reliability. Give your agent ten more tools and its per-step accuracy might hold steady, but the probability that it chains all steps correctly drops with every additional link.
Beyond accuracy: the CLEAR framework
Accuracy alone can't predict whether an agent will succeed in production. A research team proved this empirically with 300 enterprise tasks and six commercial agents.
The CLEAR framework, published in November 2025, evaluates agents across five dimensions: Cost, Latency, Efficacy, Assurance, and Reliability. The researchers found that accuracy-only evaluation correlated with production success at 0.41. CLEAR's multi-dimensional score correlated at 0.83. That's twice the predictive power.
Paper: "Beyond Accuracy: Multi-Dimensional Framework for Enterprise Agentic AI" (CLEAR) Accuracy-only evaluation correlates with production success at 0.41. Multi-dimensional evaluation (Cost, Latency, Efficacy, Assurance, Reliability) reaches 0.83. Read the paper ->
Three findings stand out:
Cost varies 50x for similar precision. Two agents that achieve the same accuracy on a task can cost $0.02 and $1.00 respectively, depending on how many inference calls they make, how much context they load, and how many retry loops they enter. Cost-aware agents were 4.4 to 10.8x cheaper than cost-ignorant alternatives while maintaining equivalent quality.
Repeated runs expose the real number. An agent that scores 60% on a single evaluation run shows much lower effective reliability across 8 runs. The single-run number is what most evaluations report. The multi-run number is what production requires. (We'll cover the exact math in the pass^k section below.)
Expert evaluation confirms the gap. When the researchers asked domain experts to rate agent outputs, the expert rankings matched CLEAR's multi-dimensional scores closely (0.83 correlation) but poorly matched accuracy-only scores (0.41). The experts noticed cost overruns, high latency, low confidence, and inconsistency. Accuracy-only metrics missed all of it.
This has practical implications for how you evaluate agents. If you're choosing between two models or two prompt configurations, accuracy alone won't tell you which one will perform better in production. You need to measure cost per interaction, response time, output quality, confidence calibration, and consistency. Any one dimension can sink an otherwise capable agent.
The multi-criteria approach maps directly to how production scorecards work. Instead of a single pass/fail, you score each quality dimension independently. An agent might nail accuracy but fail on tone consistency, or handle 95% of requests cheaply but blow the budget on the other 5%.
pass@k vs pass^k: what consistency actually means
This is the single most important concept in agent reliability, and it's one that most teams get wrong.
pass@k measures the probability that an agent succeeds at least once in k attempts. If your agent has 60% accuracy, pass@3 is about 94%. Run it three times, and you'll almost certainly get one good answer. This metric is useful for developer tools, code generation, and any context where you can cherry-pick the best output.
pass^k measures the probability that an agent succeeds on every attempt across k trials. That same 60% agent has a pass^3 of roughly 22%. It needs to be right three times in a row, and the odds are stacked against it.
Anthropic's engineering team highlighted this distinction in their "Demystifying Evals" guide. At scale, these two metrics diverge dramatically:
| Single-run accuracy | pass@8 (at least 1 success) | pass^8 (all 8 succeed) |
|---|---|---|
| 90% | >99.99% | 43% |
| 80% | >99.99% | 17% |
| 70% | 99.99% | 6% |
| 60% | 99.93% | 2% |
Reference: "Demystifying Evals" (Anthropic Engineering) pass@k (probability of at least one success in k attempts) vs pass^k (probability of success across ALL k trials) diverge dramatically at scale. The metric you choose determines whether you're measuring potential or production reliability. Read the guide ->
Look at the 80% row. An agent at 80% accuracy almost never fails to produce at least one correct answer in 8 tries. But it only succeeds on all 8 attempts 17% of the time. If that agent handles customer service requests, one in six customers experiences a failure. Even at 90% accuracy, more than half of 8-run sequences contain at least one error.
For customer-facing agents, pass^k is what matters. A customer doesn't get to run their refund request eight times and pick the best response. They get one shot, and you need that shot to be reliable. But reliability across all interactions is pass^k across your entire request volume.
This is where scenario testing becomes critical. Running a scenario once and checking if it passed is pass@1. Running the same scenario k times and checking if it passed every time is pass^k. The difference between these two protocols is the difference between "does our agent work?" and "can we trust our agent?"
Here's what measuring pass^k looks like in practice with the Chanl SDK. Run the same batch of scenarios multiple times and compare consistency:
import { Chanl } from "@chanl/sdk";
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Run all scenarios for an agent k times
async function measurePassK(agentId: string, k: number) {
const runs = [];
for (let i = 0; i < k; i++) {
const result = await chanl.scenarios.runAll({
agentId,
minScore: 80,
parallel: 3,
});
runs.push(result);
}
// pass@k: did at least one run have all scenarios pass?
const passAtK = runs.some((r) => r.allPassed);
// pass^k: did ALL runs have all scenarios pass?
const passPowerK = runs.every((r) => r.allPassed);
// Score consistency: how much does averageScore vary?
const scores = runs.map((r) => r.averageScore);
const mean = scores.reduce((a, b) => a + b, 0) / scores.length;
const variance = scores.reduce((a, s) => a + (s - mean) ** 2, 0) / scores.length;
return {
passAtK,
passPowerK,
meanScore: mean.toFixed(1),
scoreVariance: variance.toFixed(2),
runs: runs.map((r) => ({
passed: r.passed,
failed: r.failed,
averageScore: r.averageScore,
})),
};
}
const report = await measurePassK("agent_abc", 5);
console.log(`pass@5: ${report.passAtK}`);
console.log(`pass^5: ${report.passPowerK}`);
console.log(`Score variance: ${report.scoreVariance}`);A score variance above 10% on repeated runs signals a consistency problem. The agent is capable (pass@k is high) but unreliable (pass^k is low). When you see that pattern, the fix isn't a better model. It's tighter constraints, more specific instructions, or guardrails that limit the range of acceptable outputs.
Four dimensions of reliability
If reliability were a single axis, improving it would be straightforward. Researchers found that it's actually four separate dimensions, each capable of failing independently.
A February 2026 paper titled "Towards a Science of AI Agent Reliability" proposed a formal framework with four dimensions and twelve sub-metrics. What makes this research valuable isn't just the taxonomy. It's the finding that reliability doesn't improve uniformly with capability. A model upgrade might improve consistency while degrading robustness. Or it might improve safety while making predictability worse.
Paper: "Towards a Science of AI Agent Reliability" Proposes four reliability dimensions (Consistency, Robustness, Predictability, Safety) with 12 sub-metrics. Finds that reliability does not improve uniformly with capability, and models show brittleness to prompt paraphrasing. Read the paper ->
Consistency: same input, same output
Does the agent produce similar results for similar inputs? Not identical (that would require determinism), but within an acceptable range. An agent that quotes a 30-day return window on Monday and a 14-day window on Thursday has a consistency problem, even if the 30-day answer is correct.
Consistency is what most teams think of when they hear "reliability." It's also the easiest dimension to test: run the same input multiple times and measure variance.
The research found that models are particularly brittle to prompt paraphrasing. Asking "What's your return policy?" and "Can you tell me about returns?" should produce substantively identical answers. Many agents give qualitatively different responses, sometimes including different policy details, because the slight input variation activates different attention patterns.
Robustness: graceful degradation
How does the agent behave when inputs are unexpected, malformed, or adversarial? A capable agent that crashes on typos, incomplete sentences, or unexpected languages isn't reliable, regardless of how well it handles clean inputs.
Robustness failures are the most common source of production incidents. The user who types "refudn" instead of "refund." The customer who pastes an entire email thread into the chat. The voice caller whose accent triggers unexpected transcription. A reliable agent handles these gracefully. An unreliable one hallucinates, loops, or gives up.
Predictability: calibrated confidence
Does the agent know when it's uncertain? An agent that confidently states wrong answers is more dangerous than one that says "I'm not sure." Predictability means the agent's expressed confidence matches its actual accuracy.
This is where Gemini 3 Pro's 52% self-judging accuracy becomes alarming. The model can't tell the difference between its good outputs and its bad outputs roughly half the time. If your agent can't tell when it's guessing, it can't escalate appropriately, and your customers have no way to know which answers to trust.
Safety: constraint compliance
Does the agent respect its boundaries even when doing so conflicts with task completion? The most dramatic example from the Princeton research: Replit's AI coding assistant deleted a user's production database in July 2025. The agent had the capability to perform complex database operations. It lacked the reliability of respecting the constraint "never perform destructive operations without explicit confirmation."
Safety failures scale with capability. A weak agent that can't access your database can't delete it either. A capable agent with database tools can do extraordinary things and catastrophic things, and the difference often comes down to a single decision about constraint compliance.
This is why tool management matters for reliability, not just capability. The more tools your agent has access to, the more potential safety boundaries it needs to respect. Each tool is a dimension where safety compliance can fail.
Measuring reliability in practice
Most teams measure accuracy and call it a day. Research from a December 2025 survey of 306 practitioners confirms this gap: 74% still depend primarily on human evaluation, and they rate reliability as their number one challenge.
Paper: "Measuring Agents in Production" Survey of 306 practitioners found that 74% depend on human evaluation. Reliability is the number one challenge reported by production AI teams. Read the paper ->
The problem isn't awareness. Most teams know they should test reliability. The problem is that reliability testing requires different protocols than accuracy testing. Here are the three protocols that the research points to, ordered by how much they reveal.
Protocol 1: Multi-run consistency
The simplest reliability test. Run the same scenario k times with identical inputs and measure output variance.
What to look for:
- Score variance across runs. If your scorecard evaluations produce scores that swing by more than 10 points across identical runs, your agent has a consistency problem.
- Categorical stability. Does the agent give the same type of answer each time? An agent that sometimes offers a refund and sometimes offers a credit has a more serious consistency issue than one that phrases the same refund offer slightly differently.
- Failure mode concentration. When the agent fails, does it fail the same way? Consistent failure modes are easier to fix than random ones.
Multi-run testing reveals the gap between pass@1 and pass^k that we covered earlier. It's the minimum viable reliability protocol.
Protocol 2: Prompt paraphrasing
The brittleness test. Take each of your core scenarios and create 3-5 paraphrased versions of the user input. Same intent, different phrasing.
Original: "I want to cancel my subscription." Paraphrase 1: "How do I stop my monthly billing?" Paraphrase 2: "Cancel please." Paraphrase 3: "I need to end my account. The subscription one." Paraphrase 4: "hey can u cancel me"
A reliable agent handles all five identically in substance. An unreliable one might trigger the cancellation flow for the first three, give general billing information for the fourth, and misinterpret the fifth entirely.
Prompt paraphrasing tests robustness to input variation, which the reliability research identified as a major brittleness vector. It catches problems that multi-run testing misses, because multi-run uses the same input each time.
Protocol 3: Parameterized environments
The hardest and most revealing protocol. Change the context around the same request and verify the agent adapts correctly.
Same request ("I want a refund"), different environments:
- Customer has 2 previous orders (simple case)
- Customer has 47 orders across 3 years (context length stress)
- Customer's most recent order was 89 days ago (edge case near the 90-day policy boundary)
- Customer has a history of refund abuse (policy conflict)
- Customer's account is flagged for fraud review (safety constraint)
Each variation tests a different dimension. The 47-order case tests robustness under long context. The 89-day case tests precision at policy boundaries. The fraud-flag case tests safety constraint compliance. If your agent handles the simple case perfectly but breaks on any of these variations, you've found a reliability gap.
When you combine all three protocols, you're testing: the same input repeated (consistency), similar inputs with different phrasing (robustness), and the same input under different conditions (predictability and safety). Together, they provide coverage across all four reliability dimensions.
Building a reliability testing workflow
Theory is useful. A workflow you can run before every deploy is better. Here's how the research translates into a testing protocol you can implement this week.
Step 1: Establish your pass^k baseline
Before changing anything, measure where you are. Pick your 10 most important scenarios and run each one 5 times. Record the per-scenario pass rate and score variance.
You're looking for three numbers:
- pass^5 rate: What percentage of scenarios pass on ALL 5 runs?
- Mean score variance: How much do scores fluctuate across runs?
- Failure concentration: Do failures cluster on specific scenarios?
Most teams are surprised by this baseline. An agent that "works" in manual testing often shows a pass^5 below 50%. That's normal. It's also the reason customers report inconsistent experiences.
Step 2: Add paraphrase variants
For each of your 10 core scenarios, write 3 paraphrased inputs. You now have 40 test cases (10 original + 30 paraphrases). Run each one 3 times. That's 120 total executions, which sounds like a lot but completes in minutes with parallel execution.
Compare the scores between originals and paraphrases. If the paraphrased versions score more than 15 points lower on average, your agent is brittle to input variation. The fix is usually tighter system instructions that anchor behavior to intent rather than keywords.
Step 3: Introduce environment variations
Pick your 3 highest-risk scenarios (refunds, cancellations, account changes) and create 3 environment variations each: simple case, edge case, and conflict case. That's 9 more test cases.
Run these with multi-run protocol (3 times each). The edge and conflict cases will almost certainly expose reliability failures that clean-input testing misses.
Step 4: Build the scorecard that catches dimensional failures
A single pass/fail score hides the dimensions where reliability breaks. An agent that scores 85% overall might have 98% accuracy, 90% completeness, 60% tone consistency, and 40% policy adherence. The aggregate looks fine. The policy adherence is a liability.
Multi-criteria scorecards decompose quality into independent dimensions. When you run your reliability protocol, you're not just asking "did the agent pass?" You're asking "did accuracy hold?", "did tone stay consistent?", "did the agent follow policy?", and "did it respect safety boundaries?"
This is where the CLEAR framework's predictive advantage comes from. Single-dimension evaluation misses the dimension that's actually broken. Multi-dimension evaluation pinpoints it.
Step 5: Monitor reliability continuously
Testing before deploy catches regressions. Monitoring in production catches drift. Models don't degrade suddenly. They degrade gradually as the distribution of real user inputs shifts away from what your test suite covers.
Track these production metrics weekly:
- Score variance across conversations on the same topic
- Escalation rate trends (rising escalations often indicate declining reliability)
- Per-dimension scorecard trends (a dipping dimension is an early warning)
When any metric moves more than 10% from baseline, re-run your reliability protocol. The cause is usually one of three things: a model update from your provider, a prompt change that wasn't adequately tested, or a shift in the types of requests your customers are making.
The full protocol in summary
| Protocol | What it tests | Runs per scenario | Catches |
|---|---|---|---|
| Multi-run | Consistency | k identical runs | pass@k vs pass^k gap |
| Paraphrase | Robustness | 3-5 variants x 3 runs | Input brittleness |
| Parameterized | Predictability + Safety | 3 environments x 3 runs | Edge case and constraint failures |
| Multi-criteria | Dimensional | Per-run decomposition | Hidden dimensional failures |
Where this is heading
The reliability-capability gap is the defining challenge of production AI in 2026. The models are getting better. The gap isn't closing at the same rate.
Three trends are shaping what comes next:
Reliability as a first-class metric. The CLEAR framework, the Princeton analysis, and the practitioner survey all point in the same direction: teams that measure only accuracy are operating blind. Expect model providers to start publishing pass^k alongside pass@1 on their benchmarks. The teams that built reliability testing early will have the data advantage when that shift happens.
Cost-reliability tradeoffs becoming explicit. The CLEAR research showed 50x cost variation for equivalent accuracy. As teams adopt multi-dimensional evaluation, they'll discover that reliability often correlates more with architecture choices (guardrails, constraint systems, fallback chains) than with model capability. Throwing a bigger model at a reliability problem rarely fixes it. Adding structure around a smaller model often does.
Dimensional reliability becoming standard. The four-dimension framework (consistency, robustness, predictability, safety) will become the standard way teams talk about and measure agent quality. Single-number reliability scores will be understood as hiding the same problems that single-number accuracy scores hide today.
The teams that are shipping reliable agents today aren't using different models. They're testing differently. They measure pass^k instead of pass@1. They test with paraphrased inputs and edge-case environments. They decompose quality into independent dimensions and track each one separately.
Your agent probably is getting smarter. The question is whether your testing is keeping up with the gap between what it can do and what it does consistently. The research says that gap is real, it's wide, and it's growing. Closing it isn't a model problem. It's a measurement problem.
Measure Reliability, Not Just Accuracy
Run your scenarios multiple times and track consistency across runs. Multi-criteria scorecards show where reliability breaks. See where your agent fails under repetition.
Start TestingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



