It's 2:47 AM on a Tuesday when your on-call engineer gets paged. An AI agent serving thousands of customer conversations has started responding in ways that are technically coherent but completely wrong — giving users confident-sounding misinformation about return policies, pricing tiers, and account details. By the time someone notices, it's been happening for four hours. Sixty-three conversations, documented. The fix? A one-line prompt adjustment. The cost? Days of trust recovery work and a handful of refund requests you'll be fielding through the weekend.
Here's what stings most: the signal was there. Response latency had spiked slightly at 11 PM. Token count variance started climbing. One downstream evaluation score ticked down by 0.08 points. Nobody was watching those signals in combination — or if they were, the alerts were buried in a dashboard with forty-seven other notifications from that same night.
This is the real monitoring problem in 2026. It's not that teams lack data. It's that they're drowning in it.
The dashboard trap most teams fall into
More data doesn't automatically mean better visibility. This sounds obvious, but the behavior patterns in enterprise AI deployments suggest otherwise. Teams add monitoring tools, create dashboards, set up alerts — and then find that their on-call rotation is desensitized because something is always alerting. Traditional APM tools track uptime and error rates well, but they fundamentally can't answer questions about agent behavior. Was that response good? Did the agent accomplish the user's actual goal? Did it stay within your compliance guardrails?
AI agents operate non-deterministically. The same input can produce different outputs on different runs — and both might be fine, or one might be subtly wrong in a way that matters to your users. Standard infrastructure monitoring wasn't built for this. Slapping Datadog dashboards onto an LLM pipeline gives you server health but leaves the actual quality of agent behavior completely invisible.
The teams that get this right draw a hard distinction between infrastructure monitoring and behavioral monitoring. Infrastructure tells you the car is running. Behavioral monitoring tells you if it's driving in the right direction.
“Standard APM tools track latency and error rates but cannot answer critical questions about agent behavior. You need detailed trace visibility at each step of the reasoning chain.”
The two layers every production agent needs
Think of agent monitoring as two distinct stacks running in parallel.
System health covers the infrastructure: availability, response latency, dependency uptime, token usage, API error rates, and queue depths. These are table stakes — if you don't have them, you'll get surprised by outages. But they're also the easy part. Any mature cloud monitoring setup handles this.
Agent behavior covers what the agent is actually doing: response quality, task completion rates, hallucination signals, prompt effectiveness, intent coverage, sentiment shifts, and escalation patterns. This is where most teams have gaps, and it's where failures actually originate.
The mistake is treating system health as the entire monitoring picture. An agent can be technically available — sub-200ms latency, zero HTTP errors, all APIs green — while simultaneously giving users wrong information on every third turn. Your infrastructure dashboard won't catch that. Your users will.

The metrics that actually predict failures
Not all metrics are created equal. Some tell you what already happened. A few tell you what's about to happen.
Response latency — but watch the distribution, not the average
Average latency is nearly useless for prediction. What matters is the 95th and 99th percentile, and more specifically, sudden changes to those numbers over short time windows. A gradual increase in p99 latency often precedes model degradation or upstream service instability. A spike in p99 with stable averages usually signals specific query types hitting edge cases.
For voice AI agents in particular, latency isn't just a performance metric — it's a quality metric. When response generation takes longer than expected, the acoustic experience degrades in ways that push users to hang up or repeat themselves. Research on voice interaction quality consistently points to a threshold around 350-400ms: above that range, conversational rhythm breaks down and abandonment rates climb.
Token usage variance
Sudden changes in how many tokens your agent consumes per conversation are a surprisingly reliable early signal. When an agent starts generating unusually long responses, it often means it's uncertain — hedging, repeating itself, padding. When token counts drop sharply, the agent may have started truncating reasoning or skipping steps it normally performs.
Neither direction is automatically bad, but both are worth investigating. If your average conversation was 800 tokens yesterday and 1,200 tokens today and you didn't change anything, something changed anyway.
Response latency threshold (voice)
Token variance signal
Quality score drift
Quality scores in production
This is the metric that separates mature monitoring setups from the rest. If you're running evaluation scorecards during testing, you should be running them — or versions of them — in production, too. Not on every conversation (cost and latency make that impractical), but on a statistically meaningful sample. Automated scorecard evaluation can run continuously on sampled live traffic, giving you a rolling quality baseline without reviewing every conversation by hand.
When that baseline starts drifting — even slightly, even gradually — that's a warning sign. Most catastrophic quality failures in production don't arrive suddenly. They drift. Response scores decline 0.05 points over three days while engineering is focused on other things. Then something — a model update, a new query pattern, an edge case in the prompt — tips the accumulated drift into something users notice.
Watch the trend, not just the number.
Escalation and fallback rates
How often is your agent handing off to a human? How often is it saying "I don't know" or triggering a fallback flow? These rates carry signal in both directions. An escalation rate that's too low might mean the agent is confidently handling things it shouldn't be — a red flag for hallucination or out-of-scope behavior. An escalation rate that's rising without a corresponding increase in traffic volume usually means something about query distribution has shifted.
For contact center deployments, a 10-15% week-over-week increase in escalations — without a corresponding increase in total interaction volume — is worth a same-day investigation. It won't always be a crisis, but it's often the first visible symptom of one.
Semantic drift on responses
Most teams haven't touched this yet, but it's worth understanding where it's headed. Advanced drift detection can analyze the semantic patterns of responses over time, catching subtle shifts in what the agent says even when surface-level metrics look fine. If your return policy changed last Thursday and the agent's responses started shifting then — even if nobody updated the prompt — that's something you want to catch.
This kind of monitoring requires embedding-level analysis and won't come from basic APM tooling. But for agents in high-stakes domains (finance, healthcare, compliance-sensitive customer service), it's worth the investment. You can surface these patterns through conversation analytics that track topic distribution and response clustering over time.
Alert fatigue: why your team stops caring
Engineering teams running AI in production consistently report the same pattern: alerts fire constantly, and most go uninvestigated. The issue isn't volume alone — it's signal quality. When every deviation triggers a notification, on-call engineers learn to ignore the noise, and real problems get buried alongside the false positives. That's not a statement about lazy engineers — it's what happens when alerting strategy isn't designed for the noisiness of AI systems.
The noisiness is structural. Latency for an LLM call varies more than latency for a database read. Quality scores fluctuate even when the underlying model is fine. Set alert thresholds too tight, and you're paging on noise constantly. Too loose, and you'll miss real problems.
A few principles that help cut through this:
Alert on rate of change, not absolute values. A quality score of 0.73 isn't inherently alarming — you need context about what it was yesterday and last week. A drop of 0.12 points in six hours is alarming regardless of the starting value. Alerting systems that track derivatives (rates of change) over time windows catch real problems faster and produce fewer false positives than static threshold alerts.
Group correlated signals. When latency, token variance, and quality scores all move in the same direction in the same time window, that's almost certainly a real issue. When only one metric twitches, it's probably noise. Modern observability platforms can correlate these signals automatically, so on-call engineers see "three metrics degraded simultaneously" rather than three separate alerts that look unrelated.
Maintain separate alert tiers. Separate "page someone right now" from "review this in the morning" from "this is interesting, log it." Not everything that deviates from baseline is a production incident. Your SLA and your customer impact should drive tiering, not raw metric thresholds.
Define your baselines per time window. Agent behavior differs between 9 AM on Monday and 2 AM on Sunday. Alerts calibrated against a single global baseline will fire constantly on weekend nights and miss problems during peak hours. Window-aware baselines — separate weekday/weekend, separate business-hours/off-hours — cut false positive rates significantly.
The scenario you need to test before it happens
A monitoring system you haven't validated is a false sense of security. It feels reassuring right up until the moment it fails to catch something that matters.
Before you feel confident in your alerting, you should be able to answer these questions from memory: What is your current baseline quality score? What was it two weeks ago? If a prompt change silently degraded response quality by 15%, how long before your alerting would catch it? If your agent started refusing to handle a common query type that it handled fine yesterday, what would be the first signal?
If you can't answer at least two of those, your monitoring coverage has gaps.
The fix is scenario-based testing against your production monitoring stack — deliberately introducing known degradations and verifying that your alerts fire. This isn't the same as load testing or chaos engineering (though those matter too). It's specifically about validating that your quality signals behave the way you expect when something goes wrong.
Run a version of your agent with a deliberately bad prompt on 5% of traffic. Does your quality monitoring catch it? In how long? This kind of practice — treating your monitoring as something to test rather than something to set and forget — catches blind spots before customers do.
- Define baseline quality scores with separate windows for peak and off-peak traffic
- Set alerts on rate-of-change for quality metrics, not just absolute thresholds
- Monitor token usage variance as an early signal for model uncertainty
- Track escalation and fallback rates with week-over-week trend tracking
- Run correlated alert grouping so related signals surface together
- Test your monitoring by introducing known degradations and verifying alert response
- Separate infrastructure health alerts from behavioral quality alerts into distinct channels
- Review sampled production conversations weekly to catch semantic drift before metrics do
- Establish a p95 and p99 latency baseline and alert on sustained deviations
- Validate scorecard coverage — are you sampling enough conversations to be statistically meaningful?
When to actually panic (and when not to)
Not all metric deviations deserve the same response. Here's a rough guide based on patterns in how production AI incidents actually unfold.
Don't panic yet (investigate within 24 hours):
- Quality scores decline 5-8% from your 7-day baseline
- Token usage increases 20-30% without a corresponding traffic change
- Escalation rate rises 10-12% week-over-week
- A single downstream API is showing elevated error rates
Escalate now (same-day investigation required):
- Quality scores drop more than 10% in a 6-hour window
- Semantic drift analysis flags unusual clustering in responses
- Escalation rate spikes 25%+ with no corresponding traffic increase
- Token costs spike unexpectedly (potential prompt injection or runaway tool call loops)
- p99 latency exceeds your defined voice quality threshold for more than 15 minutes
Wake someone up:
- Agent is giving factually wrong information in a compliance-sensitive domain and it's been confirmed through spot-checking
- Quality scores have been declining steadily for 48+ hours and no one has acted on it
- A critical tool call (payment processing, appointment booking, account modification) is failing silently
- The agent is looping — triggering recursive tool calls or stuck in a reasoning cycle
The hardest category is the second one — the "escalate now" tier. These don't always feel urgent in the moment, especially at 2 AM when you're groggy and the metrics are "only a little weird." But they're the situations that turn into the 4 AM wake-up calls if you wait.
Establish protocols ahead of time for who responds to what tier. The on-call rotation for a 2 AM server-down alert and a "quality scores have been declining for 48+ hours" alert are probably different people with different skill sets.
Building toward predictive monitoring
The current state of the art in AI agent monitoring is reactive — you catch things after they start going wrong, just (hopefully) before customers do. The direction the field is moving is toward predictive monitoring: systems that catch the precursors to degradation before the degradation itself appears.
What does that look like in practice? It means training anomaly detection models on your historical quality data, so the system learns what "normal drift" looks like versus "drift that precedes a failure mode." It means tracking leading indicators — input query distribution shifts, upstream model version changes, seasonal traffic pattern changes — alongside the output metrics that directly measure quality.
It also means integrating your pre-production test coverage with your production monitoring. When you run A/B comparisons between prompt versions before deploying, the scoring data from those tests becomes a baseline expectation for what the deployed version should look like in production. If the deployed version starts diverging from what your pre-production testing predicted, that divergence itself is a signal.
This integration — between the testing lifecycle and the monitoring lifecycle — is where the biggest leverage is. Right now, for most teams, testing happens before deployment and monitoring happens after. They use different tools, different metrics, different mental models. Closing that gap is the key to catching problems earlier, cheaper, and with less noise.
The practical starting point
If you're reading this because you're standing up monitoring for an agent that's about to go into production, here's where to start. You don't need to build everything at once.
Week one: get your infrastructure health metrics in place. Latency, error rates, availability, dependency uptime. Wire up alerts with a 24-hour delay for non-critical deviations. This is fast and prevents the most obvious failures.
Week two: set up automated quality sampling on 10-15% of live conversations. Use the same scoring criteria you used during testing. Calculate a rolling 7-day baseline and alert when it drops more than 8% in a 24-hour window.
Week three: add escalation and fallback rate tracking. Wire this into your baseline calculation from week two.
Month two onward: add token variance tracking, semantic drift analysis, and start correlating signals rather than alerting on them individually.
The teams that get ahead of this aren't the ones with the most sophisticated monitoring stacks on day one. They're the ones that start early, establish baselines, and iterate continuously.
Monitor what actually matters — in production
Chanl's real-time monitoring connects your pre-production test coverage to live agent behavior, so you can catch quality drift before customers notice. Continuous sampling, automated scoring, and behavioral anomaly detection built for AI agents.
Explore Real-Time MonitoringWhat 2:47 AM could have looked like instead
Back to that Tuesday night incident. The same agent, the same underlying failure mode — but this time, the team had correlated alerts across quality scores, token variance, and escalation rates. At 11:14 PM, when the first signal appeared, an automated notification went to the on-call channel: "Quality score trending down 0.09 points from 6-hour baseline; token variance +28%; investigate before morning."
The on-call engineer wakes up to a Slack message — not an urgent page, because no single metric crossed a critical threshold, but a correlated signal worth a look. She spots the issue, rolls back to the previous prompt version, and goes back to sleep by midnight. The problem affected eleven conversations instead of sixty-three. No customer-facing impact significant enough to notice.
That gap — eleven conversations versus sixty-three, midnight versus 4 AM — is entirely about what you're watching and how you respond to what you see. It doesn't require exotic technology. It requires deliberate monitoring design, tested baselines, and a team that has agreed in advance what each signal means.
The metrics are there. The signals are almost always there. The question is whether you've built a system that surfaces them before customers become the ones who notice.
- AI observability tools: A buyer's guide to monitoring AI agents in production (2026) — Braintrust
- AI Agent Monitoring: Best Practices, Tools, and Metrics for 2026 — UptimeRobot Knowledge Hub
- The 17 Best AI Observability Tools In December 2025 — Monte Carlo Data
- From Alert Fatigue to Agent-Assisted Intelligent Observability — InfoQ
- Top 5 AI Agent Observability Platforms in 2026 — Maxim AI
- From Alerts to Actions: How Agentic AI Is Changing DevOps in 2026 — Bajonczak Blog
- LLM Anomaly Detection: How to Keep AI Responses on Track — Genezio
- Top 5 Tools to Monitor and Detect Hallucinations in AI Agents — Maxim AI
- 15 AI Agent Observability Tools in 2026: AgentOps & Langfuse — AIMultiple
- Top 10 LLM observability tools: Complete guide for 2025 — Braintrust
- Boosting Your Anomaly Detection With LLMs — Towards Data Science
- Top 5 AI Agent Observability Platforms in 2025 — Maxim AI
- AI Observability: A Complete Guide for 2026 — UptimeRobot Knowledge Hub
- Top 7 AI-Powered Observability Tools in 2026 — Dash0
- Leveraging LLMs for Smarter Anomaly Detection in IT Operations — Algomox
- AI and Observability — Grafana Cloud
Chanl Team
AI Agent Testing Platform
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Get AI Agent Insights
Subscribe to our newsletter for weekly tips and best practices.



