ChanlChanl
Operations

74% of Production Agents Still Rely on Human Evaluation

A survey of 306 practitioners reveals most production agents are far simpler than expected. The eval gap isn't a tooling problem. It's a trust problem.

DGDean GroverCo-founderFollow
March 27, 2026
15 min read read
Watercolor illustration of a split dashboard showing human reviewers on one side and automated scoring metrics on the other

The AI evaluation market is booming. New frameworks launch weekly. Conference talks on automated eval fill every track. Vendor demos show real-time scoring dashboards with perfect gradients and confident percentages.

Most production teams aren't using any of it.

That's the headline finding from the first large-scale survey of how agents actually work in the real world. Not how they work in papers. Not how they work in demos. How they work when a paying customer is on the other end.

The study, "Measuring AI Agents in the Real World", surveyed 306 practitioners and examined 20 detailed case studies across 26 industry domains. The numbers tell a story that doesn't match the prevailing narrative about sophisticated autonomous systems needing equally sophisticated evaluation pipelines.

74% of production agents depend primarily on human evaluation. Not as a complement to automated scoring. As the primary method.

Before we talk about why, we need to talk about what these agents actually look like in production. Because the gap between the agents people imagine and the agents people ship is where this entire conversation breaks down.

Production agents are simpler than you think

Most production agents aren't the multi-step autonomous systems that dominate conference demos and funding pitches. They're constrained, supervised, and deliberately limited. That's not a failure of ambition. It's the result of teams learning what actually works with customers.

The survey found three numbers that reframe the entire evaluation conversation:

  • 68% of agents execute ten or fewer steps before a human takes over
  • 70% rely on prompting off-the-shelf models with no fine-tuning
  • Reliability is the number one challenge, and teams address it through systems-level design rather than model improvements

That last point deserves emphasis. When something goes wrong in production, teams don't reach for a better model or a more sophisticated prompting technique. They add guardrails. They tighten constraints. They reduce the scope of what the agent can do autonomously. They build fallback paths to human operators.

This is a fundamentally different engineering posture than what most evaluation frameworks assume. The frameworks assume long-running, multi-step chains where automated scoring at each step catches accumulated errors. The reality is agents that do a handful of things, check in with a human, and move on.

What frameworks assumeWhat production looks like
20+ step autonomous chains10 or fewer steps, then human handoff
Fine-tuned domain modelsOff-the-shelf models with prompting
Model-level improvements for reliabilitySystems-level guardrails and fallbacks
Full automation as the goalHuman-in-the-loop as a feature

This isn't a temporary state. The survey found that teams intentionally keep agents constrained because reliability matters more than capability. A customer service agent that handles three task types flawlessly is more valuable than one that attempts fifteen and fails unpredictably on three.

Princeton's research on LLM reliability echoes this. Reliability improvements lag behind accuracy gains. A model that scores 90% on benchmarks can still fail unpredictably in production conditions. Teams that understand this build around the constraint rather than fighting it.

The academic-production gap is wider than you think

If production agents are simpler than the narrative suggests, why haven't evaluation practices caught up? Part of the answer is that the research community is measuring the wrong things.

A systematic review of 84 evaluation papers, published as "Adaptive Monitoring and Real-World Evaluation of LLM Agents", found a striking imbalance:

What gets measuredPercentage of papers
Capability metrics (accuracy, task completion, throughput)83%
Human-centered metrics (satisfaction, trust, usability)30%
Economic metrics (cost per interaction, ROI, error recovery cost)Less than 30%

83% of evaluation research measures what the model can do. Less than a third measures what matters to the humans using it.

This isn't an academic quibble. It shapes the tooling landscape. When researchers build evaluation frameworks, they optimize for the metrics they study. Those frameworks then get packaged into products. Those products get marketed to production teams. And production teams open the box and find tools that measure accuracy and task completion but have no way to score "did the customer feel heard?" or "did this interaction cost more than it was worth?"

The same review found that adaptive monitoring techniques can cut anomaly-detection latency from 12.3 seconds to 5.6 seconds and false-positive rates from 4.5% to 0.9%. The technology for real-time production monitoring works. The problem isn't technical capability. It's that the metrics being monitored don't map to what production teams actually care about.

Consider what a production team lead reviews when they listen to a customer conversation:

  1. Did the agent understand the actual problem (not just the stated one)?
  2. Did the response feel appropriate for the customer's emotional state?
  3. Did the agent know when to stop talking and when to escalate?
  4. Would this interaction make the customer more or less likely to come back?

None of these are capability metrics. They're judgment calls. And they're what 74% of production teams are paying humans to make.

Why human evaluation persists (and why that's rational)

Depending on human evaluation isn't a sign of immaturity. For most production teams today, it's the correct engineering decision.

Human reviewers catch things that automated metrics structurally cannot. Not because automated metrics are bad, but because the failure modes of constrained production agents are subtle, contextual, and domain-specific in ways that generic scoring frameworks miss.

Consider the failure modes that matter in production:

Tone misalignment. The agent gives a factually correct answer in a cheerful tone to a customer who just described a medical emergency. No accuracy metric catches this. No task completion metric flags it. A human reviewer catches it in two seconds.

Premature resolution. The agent correctly answers the stated question but misses the underlying need. "How do I reset my password?" might really mean "I've been locked out of my account for three days and I'm furious." The agent solves the password problem and closes the ticket. The customer churns. A human reviewer with domain experience recognizes the pattern.

Appropriate uncertainty. The agent doesn't know the answer and should say so. Instead, it hedges with confident-sounding language that technically doesn't commit to anything but leaves the customer thinking they got an answer. This is one of the hardest failure modes to score automatically because the response is fluent, relevant, and factually not wrong. It's just unhelpful.

These failure modes share a common property: they require understanding the interaction from the customer's perspective, not just the agent's. That's a judgment task, and for now, humans are better at it.

The CLEAR framework study, analyzing why only 10% of enterprises successfully implement generative AI in production, identifies inadequate evaluation frameworks as the main failure factor. But the inadequacy isn't "we don't have evaluation." It's "our evaluation doesn't measure what predicts customer outcomes."

Teams that rely on human evaluation are at least measuring the right things, even if they're measuring them slowly and inconsistently. Teams that switch to automated scoring too early risk measuring the wrong things with perfect consistency.

The path from human-only to hybrid

The answer isn't to stay on human evaluation forever. It's too expensive, too slow, and too inconsistent across reviewers. The answer is to build a bridge.

The transition from human-only to hybrid evaluation follows a pattern that works regardless of which tools you use:

Step 1: Capture what your reviewers actually check

Watch your best reviewers work. Not what the rubric says they should check. What they actually look at, in what order, and what makes them pause. Most experienced reviewers have internalized quality criteria that aren't written down anywhere. They'll tell you "this one felt off" and be right, but they can't always articulate why until you ask specific questions.

Document the criteria in plain language first. Not as scoring rubrics. As descriptions of what good looks like and what bad looks like, with real examples from actual conversations.

Step 2: Encode criteria as structured dimensions

Convert each human judgment into a question that can be answered about any conversation. "Did the agent acknowledge the customer's emotional state before solving the problem?" is evaluable. "Was the tone appropriate?" is vague.

This is where the work gets real. Each dimension needs:

  • A clear description of what's being measured
  • Examples of what scores high and what scores low
  • The weight it carries relative to other dimensions

Here's what that encoding looks like in practice. Say your best reviewer consistently flags interactions where the agent jumps to solutions without acknowledging frustration. That becomes a scorecard with four dimensions, each mapping directly to a human judgment call:

typescript
import { Chanl } from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY })
 
// Create the scorecard
const { data: scorecard } = await chanl.scorecard.create({
  name: 'Customer Empathy Check',
  scoringAlgorithm: 'weighted_average',
  passingThreshold: 70,
})
 
// Add criteria, one per human judgment call
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Emotional Acknowledgment',
  key: 'emotional-ack',
  description: 'Agent recognizes and validates the customer emotional state before moving to problem-solving',
  weight: 30,
  type: 'prompt',
})
 
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Appropriate Uncertainty',
  key: 'uncertainty',
  description: 'Agent clearly communicates when it does not have enough information rather than hedging',
  weight: 25,
  type: 'prompt',
})
 
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Resolution Completeness',
  key: 'resolution',
  description: 'Agent addresses the underlying need, not just the surface question',
  weight: 25,
  type: 'prompt',
})
 
await chanl.scorecard.createCriterion(scorecard.id, {
  name: 'Escalation Judgment',
  key: 'escalation',
  description: 'Agent escalates at the right moment, not too early and not too late',
  weight: 20,
  type: 'prompt',
})

Notice what's happening. Each criterion maps directly to something a human reviewer was already checking. The structured format just makes the evaluation consistent and repeatable.

Step 3: Run both in parallel

This is the calibration phase. Score the same conversations with both human reviewers and automated criteria. Compare the scores dimension by dimension, not as a single aggregate.

You'll find three categories:

  1. High agreement dimensions where automated scoring consistently matches human judgment. These are your candidates for automation. Typical examples: did the agent introduce itself, did it confirm the customer's identity, did it summarize the resolution.

  2. Moderate agreement dimensions where automated scoring mostly agrees but misses edge cases. These need criteria refinement. Often the issue is that the written criterion is less specific than what the human reviewer actually evaluates.

  3. Low agreement dimensions where automated and human scores diverge significantly. Keep human review on these. They usually involve subjective judgment about tone, cultural context, or "feel" that current automated scoring handles poorly.

Step 4: Shift weight gradually

As you refine criteria and agreement improves, shift review volume from human to automated. Not all at once. Dimension by dimension. The greeting and identity verification criteria might reach 95% agreement within two weeks. The empathy and uncertainty criteria might take months.

The goal isn't 100% automated. The goal is the right ratio for each quality dimension, adjusted over time as your automated criteria get better.

What "good enough" automated evaluation looks like

The destination isn't a world where humans never review conversations. It's a world where automated scoring handles the consistent, scalable work and human review focuses on the hard parts.

Good automated evaluation in production has three properties:

It's dimensional, not aggregate. A single quality score is useless for debugging. You need scores per dimension so you know whether the problem is accuracy, tone, resolution, or escalation judgment. An agent scoring 92% overall but 45% on appropriate uncertainty tells a completely different story than 92% overall with 88% on every dimension.

It runs on every conversation, not a sample. Sample-based evaluation misses patterns. If 2% of conversations have a specific failure mode, a 5% sample might catch it. Or it might not. Scoring every conversation turns evaluation from a statistical exercise into a monitoring system.

It's calibrated against human judgment regularly. Automated criteria drift. The conversations your agent handles change over time. New edge cases appear. Quarterly recalibration against human review keeps automated scores honest.

Here's what the maturity curve typically looks like for a production team:

PhaseHuman reviewAutomated scoringDuration
Discovery100% of sampled conversationsNone2-4 weeks
Encoding100% of sampled conversationsRunning in shadow mode, not trusted2-4 weeks
Calibration50% overlap with automatedAll conversations scored, compared to human4-8 weeks
HybridFocused on low-agreement dimensions + new failure modesPrimary scoring on high-agreement dimensionsOngoing
MatureSpot checks + novel edge cases + quarterly recalibrationPrimary scoring on most dimensions, alerting on dropsOngoing

Notice "mature" still includes human review. It always will. The role shifts from "score everything" to "teach the system what it doesn't know yet" and "catch what we haven't built criteria for."

Monitoring as continuous evaluation

The teams doing evaluation well share one insight: evaluation isn't a testing phase you complete before deployment. It's an ongoing monitoring activity that runs alongside production.

The traditional model looks like this: build agent, run evals, pass threshold, deploy, move on. Check back in a month. Hope nothing has changed.

The production model looks like this: every conversation is an evaluation opportunity. Every interaction gets scored against quality dimensions. Scores feed dashboards. Dashboards trigger alerts. Alerts drive investigation. Investigation produces new criteria or agent improvements. Repeat.

This is where the 74% human evaluation number becomes most interesting. If you're already paying humans to review conversations, you're already doing continuous evaluation. You're just doing it manually. The question isn't "should we start evaluating?" It's "how do we make the evaluation we're already doing more systematic, more consistent, and more scalable?"

Production monitoring in this model means scoring every conversation across quality dimensions, tracking dimension scores over time, alerting when any dimension drops below threshold, and drilling into the specific conversations that triggered the alert. It's not fundamentally different from infrastructure monitoring. You're just monitoring conversation quality instead of server uptime.

The adaptive monitoring research backs this up. When monitoring is continuous and dimension-aware, anomaly detection gets dramatically better. The 12.3-second-to-5.6-second latency improvement and the 4.5%-to-0.9% false-positive reduction come from systems that are always watching and tuned to specific quality signals, not from systems that run batch evaluations on a schedule.

For teams already running scenario testing before deployment, the monitoring layer catches what pre-deployment testing can't: distribution shifts in real customer conversations. The customers your agent talks to today aren't the same as the customers in your test scenarios. Their problems evolve. Their language shifts. New edge cases emerge from product changes, policy updates, and seasonal patterns. Continuous scoring catches these changes as they happen instead of in next month's review cycle.

The survey numbers in context

Let's come back to the big number: 74% of production agents depend primarily on human evaluation.

Given what the survey tells us about production agents, this makes sense:

  1. Agents are simple enough for human review to be feasible. Ten or fewer steps per interaction. Limited autonomy. Human reviewers can evaluate a constrained conversation in minutes.

  2. Automated eval tools measure the wrong things. 83% of evaluation research focuses on capability metrics. Production teams need judgment metrics. The tooling gap is real.

  3. Trust hasn't been established. Teams that haven't run automated and human scoring in parallel have no evidence that automated scoring catches what matters. Without evidence, why would you trust it?

  4. The cost of getting it wrong is high. These are customer-facing systems. A false sense of quality from automated scoring that misses critical failure modes is worse than slow human review that catches them.

The path forward isn't evangelizing automated evaluation harder. It's meeting teams where they are. They're already evaluating. They're already catching real issues. They need a bridge from manual review to systematic, dimension-aware scoring that maintains the judgment quality they trust.

That bridge is built one dimension at a time. Start with what your reviewers already check. Encode it. Calibrate it. Trust it only after it's earned trust through demonstrated agreement with human judgment.

The 74% will come down over the next two years. Not because the tooling gets flashier, but because teams will have spent enough time running human and automated evaluation side by side to know which automated scores they can trust and which ones still need a human pair of eyes.

The eval market doesn't need more tools. It needs more patience with the transition.

From Human Review to Automated Quality

Start with what your reviewers already check. Encode it as scorecard criteria. Run both in parallel until you trust the scores.

Start Free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions