ChanlChanl
Knowledge & Memory

Your AI Agent Isn't Learning From Production. Here's What That's Costing You.

Most AI agents are deployed and forgotten. The teams winning with AI have a different strategy: closing the loop from every live call back into the agent itself.

LDLucas DalamartaEngineering LeadFollow
March 18, 2026
14 min read
Code on a computer screen. - Photo by Rob Wingate on Unsplash

You shipped the agent. It handles calls. Your dashboard shows it's running. And you feel good, because it works.

But here's the question nobody asks often enough: is it getting better?

For most AI agents in production, the answer is no. They're static. They handle what they handled on day one. They fail in the same places they failed on day one. And the mountain of evidence sitting in your call logs, evidence that would tell you exactly what to fix, goes unread.

The teams beating you with AI aren't necessarily using a different model or a better orchestration platform. They've closed a loop you haven't: every call feeds back into the agent.

That's the data flywheel. And it's the difference between an AI deployment and an AI system that actually compounds.

What a Data Flywheel Actually Is

A data flywheel is a feedback loop where live production interactions generate improvement signal, that signal gets processed into agent changes, and those changes generate better interactions. Repeat. The key word is loop, not pipeline, not dashboard, not batch process.

Think about what happens in a single call: a customer explains their problem, the agent interprets it, decides what to do, does it (or fails), and the call ends with some outcome. That's a complete data point. You know what was said, what the agent tried, and whether it worked. That's enormously valuable.

Now multiply that by 500 calls a day. You have 500 labeled examples of your agent performing: succeeding in some cases, failing in others, handling things you never anticipated when you wrote the prompts. And if you're not using that data, you're not just missing an optimization opportunity. You're flying blind.

The flywheel closes that loop. The exact mechanism depends on your stack and team, but the logic is always the same:

  1. Capture: every call generates structured data (transcript, outcome, latency, escalation flag, satisfaction signal)
  2. Analyze: surface the patterns (what's failing, where, for which intents)
  3. Improve: change the agent (better prompts, fixed tool configs, updated knowledge, new memory patterns)
  4. Deploy: push the changes and monitor whether the failure rate drops
  5. Repeat: the improved agent generates better data, which surfaces the next layer of failures

The flywheel is self-reinforcing because fixing failure patterns doesn't just improve quality. It clears the noise from your data, making the next round of analysis sharper.

Why Production Data Is Different From Training Data

Training data is a carefully curated collection of examples. It's controlled, cleaned, and representative, by your definition of representative. That's also its fundamental limitation: it can only contain what you thought to include.

Production data doesn't care what you thought. It captures how real customers actually talk, which turns out to be consistently messier, more varied, and more creative than any training set predicts.

Here's what production data reveals that training data can't:

Failure modes at the long tail. Your training set covers the common intents. Production exposes the edge cases: the phrasings you didn't anticipate, the requests that span two intents, the conversations that start normally and take unexpected turns. These are exactly the cases where agents break down.

Real accents and speech patterns. In voice specifically, your training data might represent a relatively narrow slice of how people actually speak. Production calls surface the accents, speech disfluencies, and pacing patterns that cause transcription errors and downstream failures.

Emotional dynamics. Customers who call in frustrated or anxious interact with AI differently than customers in a neutral test session. How your agent handles escalating frustration (whether it de-escalates or makes things worse) only shows up at scale in production.

Genuine customer intent. There's often a gap between what a customer says and what they actually need. In production, you can see this gap by looking at resolution outcomes. A customer who "successfully" updated their address but called back two days later with the same problem? That's a failure your standard success metric missed.

None of this means training data is useless. It means training data gets you to day one. Production data is what makes you better on day 365.

The Anatomy of a Feedback Loop

Let's get concrete about what "closing the loop" actually requires. You need four things working together.

Call Capture That Goes Beyond Transcription

Transcripts are table stakes. What turns a transcript into useful signal is structured metadata around it.

You want to know: did the call resolve? Did the customer escalate to a human? How long did the call take? What was the primary intent? What tools did the agent invoke and which of those tool calls succeeded? If there's a post-call survey, what did the customer say?

Without this metadata, you have a record. With it, you have a labeled example. The difference matters enormously when you're trying to understand failure patterns, because you can filter to "calls where the customer escalated" and immediately see what those calls have in common.

Chanl's analytics captures this structured metadata on every interaction, not just the transcript, but the full execution trace including tool calls, memory lookups, and outcome signals. That's the raw material for everything that follows.

A Systematic Way to Surface What's Failing

Raw call data is noisy. You can't manually review 500 calls a day. You need a layer that aggregates the noise into patterns.

The most useful patterns are usually:

  • High-frequency failure intents: the same type of request failing repeatedly
  • Escalation clusters: calls that share characteristics and all end in human handoff
  • Tool failure chains: sequences where a specific tool call triggers downstream errors
  • Satisfaction outliers: calls with very low scores that don't fit the obvious failure patterns

The tooling for this ranges from simple (a SQL query on your call logs filtered by escalation flag) to sophisticated (AI-powered clustering of failure transcripts). Where you start matters less than that you start.

AI scorecards are one of the most effective instruments here. Instead of relying solely on outcome signals (which are lagging and sometimes noisy), you run a structured quality evaluation on each call, grading criteria like empathy, accuracy, procedure adherence, and resolution quality. The scoring gives you a richer signal than a binary pass/fail, and patterns in the scores tell you exactly which dimensions are degrading.

A Direct Path From Insight to Agent Change

Here's where a lot of teams stall. They've got the analysis. They can see the failures. But the path from "we know what's wrong" to "we fixed it in the agent" is tangled.

Maybe prompt changes require a pull request and a deploy cycle. Maybe knowledge base updates live in a different system. Maybe testing the fix requires spinning up a manual call test. The friction in this path is the enemy of the flywheel.

The faster you can go from insight to deployed fix, the tighter the loop. That means having:

  • Direct access to edit prompts without a full software release
  • A way to test changes against realistic scenarios before deploying them live
  • Version control on agent configuration so you can correlate changes with outcome shifts
  • Monitoring that alerts you if a fix made things worse

This is why agent infrastructure (not just the LLM) matters so much. The model is one component. The operational layer around it determines whether you can actually iterate.

Monitoring That Closes the Loop

After you deploy a fix, you need to know if it worked. This sounds obvious, but it's genuinely hard to do in production AI.

The challenge is signal lag. A prompt change might improve resolution rate, but you won't see that clearly in aggregate metrics for days because resolution rate is noisy. What you want is a way to track whether the specific failure pattern you targeted actually decreased.

That requires tagging your changes and correlating them with outcome metrics for the affected call type. Manual for small teams, automated for larger ones. But the principle is the same: your monitoring needs to be granular enough to tell you whether a specific intervention worked, not just whether the overall numbers moved.

What Gets Better When the Flywheel Spins

The immediate wins from closing this loop tend to come in a predictable order.

First: eliminating known failure patterns. Your first few analysis cycles will surface the high-frequency failures, the things breaking consistently that you can fix with targeted prompt changes or tool configuration updates. These are quick wins that often produce meaningful accuracy improvements.

Next: improving handling of edge cases. Once the obvious failures are patched, you start seeing the longer tail: the weird requests, the ambiguous intents, the multi-step problems. These require more nuanced interventions: richer context in the prompt, additional knowledge base entries, or memory that persists context across turns.

Eventually: architectural improvements. Some failure patterns are systematic. They indicate that the agent's fundamental approach to a problem type is wrong. These surface later, after you've cleared the noise from simpler failures. They require bigger changes but have bigger impact.

The timing varies by call volume. Higher volume means faster pattern emergence. A team handling 1,000 calls a day will surface meaningful patterns in days. A team handling 100 calls a day might take a few weeks. But the pattern is consistent: more data, more signal, more specific fixes, better outcomes, better data.

The Compounding Effect

Here's the thing about flywheels that doesn't show up in a single-quarter analysis: the value compounds.

When you fix your top failure patterns, two things happen. Your overall quality improves. And the remaining failure patterns become easier to see, because they're no longer buried under the noise of the high-frequency failures you just eliminated.

This means each subsequent analysis cycle is more targeted. You're not wading through the same common failures. You're identifying progressively more subtle issues. And the fixes for those issues tend to have higher precision, because you understand the agent's behavior better, you can write better prompts, configure better tools, build better test cases.

The teams that have been running this loop for 12 months aren't just "better at AI" in some general sense. They have a specific, compounding advantage: they've fixed layer after layer of failure, and each layer they've fixed has revealed the next one to address. Their agents handle edge cases that their competitors' agents can't, because they've seen those edge cases in production and built explicit handling for them.

That's a moat. It's not built from a better model. It's built from a tighter loop, run consistently over time.

A Practical Starting Point

You don't need a sophisticated ML pipeline to start. Here's the minimum viable flywheel:

Week 1: Start capturing structured outcome data on every call. At minimum: did it resolve, did it escalate, what was the primary intent. Log this alongside the transcript.

Week 2: Pull all escalated calls from the past two weeks and read a sample. What patterns do you see? Document them: intent types, failure modes, specific phrasings that broke.

Week 3: Make targeted changes to address the top 2-3 failure patterns you identified. Deploy to a subset of calls if possible.

Week 4: Compare resolution and escalation rates before and after. Did the targeted failures decrease?

Ongoing: Build this into a weekly cadence. One hour of analysis, one or two targeted changes, monitoring of the impact. That's it.

The sophistication can come later: automated scoring, clustering, A/B testing of prompt variants. But the loop itself is what matters. Start closing it with whatever you have.

What This Doesn't Solve

Worth being honest about the limits.

A data flywheel won't help you if your agent is wrong at the architectural level: wrong model, wrong approach, wrong problem framing. Those require rethinking, not iteration.

It won't help you hit the theoretical ceiling of what your current setup can achieve. At some point, the remaining failures require fundamental capability improvements: a better model, a new tool, a different conversation design. The flywheel gets you to that ceiling efficiently; it doesn't break through it.

And it won't help you if you can't act on the insights. If your agent configuration is locked inside a vendor platform with no way to iterate quickly, the analysis is interesting but the loop can't close. The feedback loop requires a tight path from insight to change to deployment.

The Question Worth Asking

There's a useful framing for any AI deployment: if your agent is running the same as it was six months ago, why?

Not as a criticism; sometimes stability is what you want. But in customer-facing AI, customer behavior evolves, product changes, new questions arise. An agent that isn't adapting is an agent that's slowly becoming less relevant.

The data flywheel isn't a technology. It's a discipline. The technology (call capture, scoring, analytics, scenario testing) exists to make that discipline faster and more systematic. But the underlying practice is simple: look at what your agent is doing in production, understand where it's failing, fix it, and check that the fix worked.

Do that consistently, and your agent will be materially different, better, in 90 days. And different again in another 90.

That's the compounding advantage. Not a dramatic breakthrough. Just a loop, closed consistently, over time.

LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions