Why do 74% of production agents still depend on human evaluation?

According to a 2025 survey of 306 practitioners across 26 domains, most production agents are simpler than expected. 68% execute ten or fewer steps before human intervention, and 70% use off-the-shelf models without fine-tuning. These constrained systems make human review feasible and, for now, more trusted than automated alternatives.

How complex are most production AI agents?

Less complex than the industry narrative suggests. The survey found 68% of production agents execute ten or fewer steps before a human takes over. 70% rely on prompting off-the-shelf models with no fine-tuning. Teams address reliability through systems-level design like guardrails and fallbacks rather than model improvements.

What is the gap between academic AI evaluation and production evaluation?

A systematic review of 84 evaluation papers found 83% report capability metrics like accuracy and task completion, while only 30% consider human-centered or economic measures. Academic eval focuses on what models can do. Production eval needs to measure what matters to the business and its customers.

What is a hybrid evaluation approach for AI agents?

Hybrid evaluation runs human review and automated scoring in parallel. Teams capture what human reviewers check for, encode those criteria as automated scorecard dimensions, measure correlation between human and automated scores, and gradually shift weight toward automation as trust builds. This bridges the gap without forcing teams to abandon human judgment overnight.

How should teams transition from human-only to automated agent evaluation?

Start by documenting what your best human reviewers actually evaluate. Convert those criteria into structured scorecard dimensions with clear rubrics. Run both human and automated scoring on the same conversations for several weeks. Compare scores, calibrate the automated criteria, and only reduce human review on dimensions where automated scoring consistently agrees with human judgment.

What does continuous evaluation look like for production agents?

Instead of periodic batch reviews, every conversation gets scored against quality dimensions in real time. Teams set alert thresholds for score drops, track trends over time, and investigate anomalies before they become customer-facing problems. This turns evaluation from a testing activity into an always-on monitoring layer.

Is 100% automated evaluation the goal for production AI agents?

No. The goal is complementary coverage. Automated scoring handles consistency, scale, and speed. Human review handles novel failure modes, subjective quality dimensions, and the edge cases that automated criteria haven't been trained to catch. The best production teams run both, adjusting the ratio as their automated criteria mature.

74% of Production Agents Still Rely on Human Evaluation

Q: Why do only 10% of enterprises successfully implement generative AI in production?

The CLEAR framework study identifies inadequate evaluation frameworks as the main failure factor. Teams that cannot reliably measure agent quality cannot confidently ship to production. The gap between prototype performance and production reliability stalls most deployments.

The AI evaluation market is booming. New frameworks launch weekly. Conference talks on automated eval fill every track. Vendor demos show real-time scoring dashboards with perfect gradients and confident percentages.

Most production teams aren't using any of it.

That's the headline finding from the first large-scale survey of how agents actually work in the real world. Not how they work in papers. Not how they work in demos. How they work when a paying customer is on the other end.

The study, "Measuring AI Agents in the Real World", surveyed 306 practitioners and examined 20 detailed case studies across 26 industry domains. The numbers tell a story that doesn't match the prevailing narrative about sophisticated autonomous systems needing equally sophisticated evaluation pipelines.

74% of production agents depend primarily on human evaluation. Not as a complement to automated scoring. As the primary method.

Before we talk about why, we need to talk about what these agents actually look like in production. Because the gap between the agents people imagine and the agents people ship is where this entire conversation breaks down.

Production agents are simpler than you think

Most production agents aren't the multi-step autonomous systems that dominate conference demos and funding pitches. They're constrained, supervised, and deliberately limited. That's not a failure of ambition. It's the result of teams learning what actually works with customers.

The survey found three numbers that reframe the entire evaluation conversation:

68% of agents execute ten or fewer steps before a human takes over
70% rely on prompting off-the-shelf models with no fine-tuning
Reliability is the number one challenge, and teams address it through systems-level design rather than model improvements

That last point deserves emphasis. When something goes wrong in production, teams don't reach for a better model or a more sophisticated prompting technique. They add guardrails. They tighten constraints. They reduce the scope of what the agent can do autonomously. They build fallback paths to human operators.

This is a fundamentally different engineering posture than what most evaluation frameworks assume. The frameworks assume long-running, multi-step chains where automated scoring at each step catches accumulated errors. The reality is agents that do a handful of things, check in with a human, and move on.

What frameworks assume	What production looks like
20+ step autonomous chains	10 or fewer steps, then human handoff
Fine-tuned domain models	Off-the-shelf models with prompting
Model-level improvements for reliability	Systems-level guardrails and fallbacks
Full automation as the goal	Human-in-the-loop as a feature

This isn't a temporary state. The survey found that teams intentionally keep agents constrained because reliability matters more than capability. A customer service agent that handles three task types flawlessly is more valuable than one that attempts fifteen and fails unpredictably on three.

Princeton's research on LLM reliability echoes this. Reliability improvements lag behind accuracy gains. A model that scores 90% on benchmarks can still fail unpredictably in production conditions. Teams that understand this build around the constraint rather than fighting it.

The academic-production gap is wider than you think

If production agents are simpler than the narrative suggests, why haven't evaluation practices caught up? Part of the answer is that the research community is measuring the wrong things.

A systematic review of 84 evaluation papers, published as "Adaptive Monitoring and Real-World Evaluation of LLM Agents", found a striking imbalance:

What gets measured	Percentage of papers
Capability metrics (accuracy, task completion, throughput)	83%
Human-centered metrics (satisfaction, trust, usability)	30%
Economic metrics (cost per interaction, ROI, error recovery cost)	Less than 30%

83% of evaluation research measures what the model can do. Less than a third measures what matters to the humans using it.

This isn't an academic quibble. It shapes the tooling landscape. When researchers build evaluation frameworks, they optimize for the metrics they study. Those frameworks then get packaged into products. Those products get marketed to production teams. And production teams open the box and find tools that measure accuracy and task completion but have no way to score "did the customer feel heard?" or "did this interaction cost more than it was worth?"

The same review found that adaptive monitoring techniques can cut anomaly-detection latency from 12.3 seconds to 5.6 seconds and false-positive rates from 4.5% to 0.9%. The technology for real-time production monitoring works. The problem isn't technical capability. It's that the metrics being monitored don't map to what production teams actually care about.

Consider what a production team lead reviews when they listen to a customer conversation:

Did the agent understand the actual problem (not just the stated one)?
Did the response feel appropriate for the customer's emotional state?
Did the agent know when to stop talking and when to escalate?
Would this interaction make the customer more or less likely to come back?

None of these are capability metrics. They're judgment calls. And they're what 74% of production teams are paying humans to make.

Why human evaluation persists (and why that's rational)

Depending on human evaluation isn't a sign of immaturity. For most production teams today, it's the correct engineering decision.

Human reviewers catch things that automated metrics structurally cannot. Not because automated metrics are bad, but because the failure modes of constrained production agents are subtle, contextual, and domain-specific in ways that generic scoring frameworks miss.

Consider the failure modes that matter in production:

Tone misalignment. The agent gives a factually correct answer in a cheerful tone to a customer who just described a medical emergency. No accuracy metric catches this. No task completion metric flags it. A human reviewer catches it in two seconds.

Premature resolution. The agent correctly answers the stated question but misses the underlying need. "How do I reset my password?" might really mean "I've been locked out of my account for three days and I'm furious." The agent solves the password problem and closes the ticket. The customer churns. A human reviewer with domain experience recognizes the pattern.

Appropriate uncertainty. The agent doesn't know the answer and should say so. Instead, it hedges with confident-sounding language that technically doesn't commit to anything but leaves the customer thinking they got an answer. This is one of the hardest failure modes to score automatically because the response is fluent, relevant, and factually not wrong. It's just unhelpful.

These failure modes share a common property: they require understanding the interaction from the customer's perspective, not just the agent's. That's a judgment task, and for now, humans are better at it.

The CLEAR framework study, analyzing why only 10% of enterprises successfully implement generative AI in production, identifies inadequate evaluation frameworks as the main failure factor. But the inadequacy isn't "we don't have evaluation." It's "our evaluation doesn't measure what predicts customer outcomes."

Teams that rely on human evaluation are at least measuring the right things, even if they're measuring them slowly and inconsistently. Teams that switch to automated scoring too early risk measuring the wrong things with perfect consistency.

The path from human-only to hybrid

The answer isn't to stay on human evaluation forever. It's too expensive, too slow, and too inconsistent across reviewers. The answer is to build a bridge.

The transition from human-only to hybrid evaluation follows a pattern that works regardless of which tools you use:

Step 1: Capture what your reviewers actually check

Watch your best reviewers work. Not what the rubric says they should check. What they actually look at, in what order, and what makes them pause. Most experienced reviewers have internalized quality criteria that aren't written down anywhere. They'll tell you "this one felt off" and be right, but they can't always articulate why until you ask specific questions.

Document the criteria in plain language first. Not as scoring rubrics. As descriptions of what good looks like and what bad looks like, with real examples from actual conversations.

Step 2: Encode criteria as structured dimensions

Convert each human judgment into a question that can be answered about any conversation. "Did the agent acknowledge the customer's emotional state before solving the problem?" is evaluable. "Was the tone appropriate?" is vague.

This is where the work gets real. Each dimension needs:

A clear description of what's being measured
Examples of what scores high and what scores low
The weight it carries relative to other dimensions

Here's what that encoding looks like in practice. Say your best reviewer consistently flags interactions where the agent jumps to solutions without acknowledging frustration. That becomes a scorecard with four dimensions, each mapping directly to a human judgment call:

typescript

import { Chanl } from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY })
 
// Create the scorecard
const { data: scorecard } = await chanl.scorecard.create({
  name: 'Customer Empathy Check',
  scoringAlgorithm: 'weighted_average',
  passingThreshold: 70,
})
 
// Add criteria, one per human judgment call
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Emotional Acknowledgment',
  key: 'emotional-ack',
  description: 'Agent recognizes and validates the customer emotional state before moving to problem-solving',
  weight: 30,
  type: 'prompt',
})
 
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Appropriate Uncertainty',
  key: 'uncertainty',
  description: 'Agent clearly communicates when it does not have enough information rather than hedging',
  weight: 25,
  type: 'prompt',
})
 
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Resolution Completeness',
  key: 'resolution',
  description: 'Agent addresses the underlying need, not just the surface question',
  weight: 25,
  type: 'prompt',
})
 
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Escalation Judgment',
  key: 'escalation',
  description: 'Agent escalates at the right moment, not too early and not too late',
  weight: 20,
  type: 'prompt',
})

Notice what's happening. Each criterion maps directly to something a human reviewer was already checking. The structured format just makes the evaluation consistent and repeatable.

Step 3: Run both in parallel

This is the calibration phase. Score the same conversations with both human reviewers and automated criteria. Compare the scores dimension by dimension, not as a single aggregate.

You'll find three categories:

High agreement dimensions where automated scoring consistently matches human judgment. These are your candidates for automation. Typical examples: did the agent introduce itself, did it confirm the customer's identity, did it summarize the resolution.
Moderate agreement dimensions where automated scoring mostly agrees but misses edge cases. These need criteria refinement. Often the issue is that the written criterion is less specific than what the human reviewer actually evaluates.
Low agreement dimensions where automated and human scores diverge significantly. Keep human review on these. They usually involve subjective judgment about tone, cultural context, or "feel" that current automated scoring handles poorly.

Step 4: Shift weight gradually

As you refine criteria and agreement improves, shift review volume from human to automated. Not all at once. Dimension by dimension. The greeting and identity verification criteria might reach 95% agreement within two weeks. The empathy and uncertainty criteria might take months.

The goal isn't 100% automated. The goal is the right ratio for each quality dimension, adjusted over time as your automated criteria get better.

What "good enough" automated evaluation looks like

The destination isn't a world where humans never review conversations. It's a world where automated scoring handles the consistent, scalable work and human review focuses on the hard parts.

Good automated evaluation in production has three properties:

It's dimensional, not aggregate. A single quality score is useless for debugging. You need scores per dimension so you know whether the problem is accuracy, tone, resolution, or escalation judgment. An agent scoring 92% overall but 45% on appropriate uncertainty tells a completely different story than 92% overall with 88% on every dimension.

It runs on every conversation, not a sample. Sample-based evaluation misses patterns. If 2% of conversations have a specific failure mode, a 5% sample might catch it. Or it might not. Scoring every conversation turns evaluation from a statistical exercise into a monitoring system.

It's calibrated against human judgment regularly. Automated criteria drift. The conversations your agent handles change over time. New edge cases appear. Quarterly recalibration against human review keeps automated scores honest.

Here's what the maturity curve typically looks like for a production team:

Phase	Human review	Automated scoring	Duration
Discovery	100% of sampled conversations	None	2-4 weeks
Encoding	100% of sampled conversations	Running in shadow mode, not trusted	2-4 weeks
Calibration	50% overlap with automated	All conversations scored, compared to human	4-8 weeks
Hybrid	Focused on low-agreement dimensions + new failure modes	Primary scoring on high-agreement dimensions	Ongoing
Mature	Spot checks + novel edge cases + quarterly recalibration	Primary scoring on most dimensions, alerting on drops	Ongoing

Notice "mature" still includes human review. It always will. The role shifts from "score everything" to "teach the system what it doesn't know yet" and "catch what we haven't built criteria for."

Monitoring as continuous evaluation

The teams doing evaluation well share one insight: evaluation isn't a testing phase you complete before deployment. It's an ongoing monitoring activity that runs alongside production.

The traditional model looks like this: build agent, run evals, pass threshold, deploy, move on. Check back in a month. Hope nothing has changed.

The production model looks like this: every conversation is an evaluation opportunity. Every interaction gets scored against quality dimensions. Scores feed dashboards. Dashboards trigger alerts. Alerts drive investigation. Investigation produces new criteria or agent improvements. Repeat.

This is where the 74% human evaluation number becomes most interesting. If you're already paying humans to review conversations, you're already doing continuous evaluation. You're just doing it manually. The question isn't "should we start evaluating?" It's "how do we make the evaluation we're already doing more systematic, more consistent, and more scalable?"

Production monitoring in this model means scoring every conversation across quality dimensions, tracking dimension scores over time, alerting when any dimension drops below threshold, and drilling into the specific conversations that triggered the alert. It's not fundamentally different from infrastructure monitoring. You're just monitoring conversation quality instead of server uptime.

The adaptive monitoring research backs this up. When monitoring is continuous and dimension-aware, anomaly detection gets dramatically better. The 12.3-second-to-5.6-second latency improvement and the 4.5%-to-0.9% false-positive reduction come from systems that are always watching and tuned to specific quality signals, not from systems that run batch evaluations on a schedule.

For teams already running scenario testing before deployment, the monitoring layer catches what pre-deployment testing can't: distribution shifts in real customer conversations. The customers your agent talks to today aren't the same as the customers in your test scenarios. Their problems evolve. Their language shifts. New edge cases emerge from product changes, policy updates, and seasonal patterns. Continuous scoring catches these changes as they happen instead of in next month's review cycle.

The survey numbers in context

Let's come back to the big number: 74% of production agents depend primarily on human evaluation.

Given what the survey tells us about production agents, this makes sense:

Agents are simple enough for human review to be feasible. Ten or fewer steps per interaction. Limited autonomy. Human reviewers can evaluate a constrained conversation in minutes.
Automated eval tools measure the wrong things. 83% of evaluation research focuses on capability metrics. Production teams need judgment metrics. The tooling gap is real.
Trust hasn't been established. Teams that haven't run automated and human scoring in parallel have no evidence that automated scoring catches what matters. Without evidence, why would you trust it?
The cost of getting it wrong is high. These are customer-facing systems. A false sense of quality from automated scoring that misses critical failure modes is worse than slow human review that catches them.

The path forward isn't evangelizing automated evaluation harder. It's meeting teams where they are. They're already evaluating. They're already catching real issues. They need a bridge from manual review to systematic, dimension-aware scoring that maintains the judgment quality they trust.

That bridge is built one dimension at a time. Start with what your reviewers already check. Encode it. Calibrate it. Trust it only after it's earned trust through demonstrated agreement with human judgment.

The 74% will come down over the next two years. Not because the tooling gets flashier, but because teams will have spent enough time running human and automated evaluation side by side to know which automated scores they can trust and which ones still need a human pair of eyes.

The eval market doesn't need more tools. It needs more patience with the transition.

From Human Review to Automated Quality

Start with what your reviewers already check. Encode it as scorecard criteria. Run both in parallel until you trust the scores.

Start Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluation production scorecards monitoring operations agent-infrastructure testing

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed