A VIP customer calls your support line for the fourth time this week. She's been working through a complex billing migration. Three previous agents documented every step: the credits applied, the plan changes confirmed, the escalation path agreed upon. Your AI agent picks up the call, greets her by name, and asks how it can help today.
She says, "I'm following up on the migration we discussed on Monday."
The agent responds with a generic billing FAQ. It doesn't mention the credits. It doesn't reference the escalation. It doesn't know about Monday. The task it was given, "answer the customer's billing question," will show as completed in your dashboard. Green checkmark. Another successful resolution.
Except it wasn't.
Table of contents
- The metric that lies to you
- 13.1% recall, 100% task completion
- How memory fails as complexity grows
- The compounding catastrophe
- Why your evals don't catch this
- Drift: the slow version of forgetting
- Building evaluations that test memory
- External memory as ground truth
- What to do about it
The metric that lies to you
Task completion rate is the most popular metric for evaluating AI agents, and it is also the most misleading one. It tells you whether the agent produced an output. It says nothing about whether the agent used the right information to produce it.
Think about what "task completed" actually means. The agent received an input. It generated a response. The response matched some criteria for success. That's it. Nobody checked what the agent knew when it generated that response. Nobody verified that it retrieved the customer's history, or that it remembered the policy change from last Tuesday, or that it used the correct tier-specific pricing.
The agent completed the task the way a student passes a multiple-choice exam by guessing. The output looks right. The process behind it is broken.
A December 2025 assessment framework for agentic AI systems, published by researchers evaluating agents across four dimensions (LLM reasoning, memory, tools, and environment), found something that should make every production team uncomfortable: agents routinely completed tasks while failing to retrieve the information those tasks required.
Research: "Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems" (arXiv:2512.12791, Dec 2025)
Key finding: Memory retrieval achieved only 13.1% recall in complex scenarios, even when the agents completed the tasks they were assigned. Agents forgot previous role mappings, configuration changes, and context established in earlier turns.
13.1% recall. The agent remembered roughly one out of every eight things it had been told or had stored. And it still completed the task.
13.1% recall, 100% task completion
How does an agent complete a task while remembering almost nothing? The same way people do: it fills in the gaps.
Large language models are extraordinarily good at generating plausible responses. When an agent can't retrieve a specific fact, it doesn't throw an error or say "I don't know." It generates something that sounds right. It confabulates. It takes the general shape of what it knows about billing questions or account migrations and produces a response that reads like it came from someone who did their homework.
The customer might not even notice. The response is fluent, professional, correctly formatted. It uses the right terminology. It might even be partially correct, because the agent's general knowledge about your product domain is decent even when its specific memory of this customer is gone.
But "partially correct" in customer service is a specific kind of failure. Telling a customer their credit is $50 when it's actually $150 isn't a minor error. Suggesting they're on the Standard plan when they migrated to Enterprise last week isn't a rounding error. These are the kinds of mistakes that erode trust in ways that no CSAT survey fully captures, because the customer doesn't always know you got it wrong. They just feel like something is off.
The 13.1% recall finding isn't about agents that crashed or returned errors. It's about agents that looked fine. They completed their tasks. They produced polished responses. They just happened to be working with 13% of the information they should have had.
How memory fails as complexity grows
Memory failures don't stay constant as tasks get harder. They scale, and they scale fast. The same assessment framework tracked failure rates across three levels of task complexity, and the pattern is a straight line headed in the wrong direction.
| Task complexity | Average memory failures per session |
|---|---|
| Simple (single-turn, one fact needed) | 0.67 |
| Moderate (multi-turn, multiple facts) | 2.33 |
| Complex (multi-step, cross-reference required) | 3.67 |
Simple tasks are fine. Ask the agent to look up one thing and respond, it usually gets it right. The memory retrieval challenge is minimal because there's only one fact to retrieve.
Moderate tasks are where cracks appear. These are conversations where the agent needs to hold multiple facts in play: the customer's plan, their previous interaction, the policy that applies to their tier. Two or three memory retrievals, and failures more than triple.
Complex tasks are where the system breaks down. These require the agent to cross-reference information across turns, remember configuration changes, and apply the right policy based on a combination of stored facts. The failure rate jumps to 3.67 per session, meaning the agent is making nearly four memory errors in a single interaction.
Here's why this matters for production: simple tasks are the ones you test in demos. Complex tasks are the ones your customers actually have. The VIP on her fourth call this week isn't running a simple query. She's deep in a multi-step process that requires the agent to synthesize information from three previous conversations. That's exactly the scenario where memory fails most catastrophically.
And most eval suites don't test beyond moderate complexity. They run single-turn or short multi-turn conversations that stay comfortably in the zone where memory mostly works. Your dashboard says 95% task completion. Your customers are having a very different experience.
The compounding catastrophe
Memory failures are bad enough in isolation. But agents don't just have memory. They have tools. And when memory fails at the same time tools fail, the result is worse than either failure alone.
The same research framework found that tool orchestration had the highest failure rate of any dimension: 7.67 average failures across complex scenarios. Agents skipped diagnostic steps before attempting remediation. They called the wrong tools, or called the right tools with wrong parameters, or skipped tools entirely when they should have used them.
Now combine the two. An agent retrieves stale information about a customer's account (memory failure), then uses that stale information to decide which tool to call (tool orchestration failure). It looks up the wrong plan details and confidently processes a change that doesn't apply.
The dangerous part is the confidence. The agent doesn't hedge. It doesn't say "I'm not sure about your current plan, let me verify." It states incorrect facts with the same fluency it uses for correct ones, and then takes action based on those incorrect facts. This is the failure mode that costs real money: not the agent that crashes, but the agent that confidently does the wrong thing.
Research: "Towards a Science of AI Agent Reliability" (arXiv:2602.16666, Feb 2026)
Key finding: Agents exhibit systematic miscalibration between confidence and actual performance. Consistency and predictability are independent dimensions from accuracy. An agent can be consistently wrong with high confidence.
This miscalibration is what makes memory failures invisible to traditional monitoring. If the agent expressed uncertainty when its memory retrieval failed, you could catch it. But it doesn't. The output looks the same whether the agent retrieved the correct fact or fabricated one. The confidence level doesn't change. The tone doesn't shift. The only way to know is to check what the agent actually retrieved versus what it should have known.
Why your evals don't catch this
Standard evaluation frameworks measure output quality. Did the response answer the question? Was the format correct? Did the agent follow the conversation flow? These are important, but they measure the wrong layer when memory is the failure mode.
Consider a typical eval for a customer support agent. You feed it a conversation transcript. The agent generates a response. A rubric checks whether the response was helpful, whether it addressed the customer's question, whether it was polite and professional. The agent gets a high score because its response was, in fact, helpful and professional. It just happened to use fabricated account details instead of the real ones.
Output-based evals are blind to memory failures for three reasons.
They don't test retrieval. The eval checks what the agent said, not what it looked up. If the agent's response is plausible without retrieving any stored information, the eval passes whether or not retrieval happened.
They don't have ground truth for memory. The eval knows what a good response looks like. It doesn't know what the agent should have remembered. Without a reference dataset of "this agent stored these facts and should retrieve them in this context," there's nothing to compare against.
They test single interactions, not continuity. Most evals run independent test cases. They don't test whether the agent remembered something from case 7 when it ran case 12. But that's exactly what customers experience: a sequence of interactions where memory matters across sessions.
Anthropic's evals guide makes this point directly: conversational agents need multi-dimensional success criteria that go beyond task completion. Interaction quality, consistency across turns, and appropriate use of available information are separate dimensions that standard benchmarks collapse into a single score.
Anthropic Engineering: "Demystifying Evals for AI Agents"
Key finding: Evaluation of conversational agents requires multi-dimensional success criteria including interaction quality, not just task-level pass/fail. Single-score metrics hide failure modes that matter in production.
Drift: the slow version of forgetting
Memory failures aren't always sudden. Sometimes the agent gradually shifts away from what it knew, turn by turn, until the original information is effectively gone. This is agent drift.
Semantic drift is the progressive deviation from an agent's original intent over extended interactions. The agent starts the conversation correctly grounded: it knows the customer's plan, their history, the applicable policies. But as the conversation extends, each turn introduces small perturbations. The agent's attention shifts. New information partially overwrites old context. By turn 15, the agent is operating on a subtly different understanding than it had at turn 1.
Research: "Agent Drift" (arXiv:2601.04170, Jan 2026)
Key finding: Semantic drift causes progressive deviation from original intent in extended agent interactions. Episodic memory consolidation, where recent interactions are periodically distilled into stable knowledge, is proposed as a mitigation.
This is harder to test for than sudden forgetting because the degradation is gradual. The agent doesn't go from "correct" to "wrong" in one step. It goes from "correct" to "slightly off" to "mostly off" to "confidently wrong" over the course of a long interaction. Each individual turn looks reasonable. The trajectory is the problem.
Drift is particularly insidious in agents that handle long customer journeys. A customer working through a multi-week onboarding process, a complex return, an escalation that spans multiple sessions. These are the interactions where drift accumulates, and they're also the interactions where getting it right matters most.
The proposed mitigation, episodic memory consolidation, works by periodically extracting stable facts from recent interactions and storing them externally. Instead of relying on the model's context window to maintain accuracy over 50 turns, you extract what's true after every few turns and write it to a persistent store. The agent can then retrieve ground truth rather than depending on its increasingly fuzzy internal state.
Building evaluations that test memory
If output-based evals miss memory failures, what do you actually need? Evaluations that test what the agent knows, not just what it says.
This requires a different architecture for your eval pipeline. You need ground truth for what the agent should remember, a way to probe whether it actually retrieves that information, and scoring criteria that penalize correct-sounding responses built on incorrect data.
Here's what that looks like in practice. You build scenarios that require the agent to reference specific stored facts across multiple turns. Not "answer the billing question" but "answer the billing question using the credit that was applied on March 3rd." The eval doesn't just check whether the response mentions billing. It checks whether the specific credit amount appears, whether the date is correct, whether the agent referenced the previous interaction where the credit was agreed upon.
import { Chanl } from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Store ground truth facts before running the scenario
await chanl.memory.create({
entityType: 'customer',
entityId: 'vip-customer-42',
agentId: 'support-agent-01',
content: 'Customer is on Enterprise plan after migration on March 3rd',
key: 'plan',
value: 'Enterprise',
});
await chanl.memory.create({
entityType: 'customer',
entityId: 'vip-customer-42',
agentId: 'support-agent-01',
content: '$150 credit applied during billing migration',
key: 'credit_applied',
value: '$150',
});
await chanl.memory.create({
entityType: 'customer',
entityId: 'vip-customer-42',
agentId: 'support-agent-01',
content: 'Escalation contact is Sarah Chen, Account Manager',
key: 'escalation_contact',
value: 'Sarah Chen, Account Manager',
});
// Run a scenario that requires these facts
const { data: execution } = await chanl.scenarios.run('vip-followup-billing', {
agentId: 'support-agent-01',
parameters: {
customerName: 'Maria Torres',
contactId: 'vip-customer-42',
openingMessage: "Hi, I'm following up on the billing migration we discussed on Monday."
}
});
// Evaluate with a memory-retention scorecard
// The scorecard's criteria test specific stored facts:
// - Did the agent mention the March 3rd migration call? (weight: 30%)
// - Did the agent reference the $150 credit? (weight: 30%)
// - Did the agent use Enterprise-tier policies? (weight: 20%)
// - Did the agent mention Sarah Chen as escalation contact? (weight: 20%)
const { data: result } = await chanl.scorecard.evaluate(
execution.execution.callDetails?.callId!,
{ scorecardId: 'memory-retention-scorecard' }
);
console.log(`Memory retention score: ${result.status}`);
// A standard eval would score this 90%+ on "helpfulness."
// A memory-aware eval reveals whether the agent actually used what it knew.The key difference: every criterion maps to a specific stored fact. The scorecard doesn't ask "was the response good?" It asks "did the agent use fact X from memory store Y?" That's how you catch the 86.9% gap between what the agent stored and what it retrieved.
You can also verify retrieval independently of the conversation. After the scenario runs, query the same memory store and compare what's there to what the agent actually used in its response:
// After the scenario runs, verify what the agent could have retrieved
const { data: searchResults } = await chanl.memory.search({
entityType: 'customer',
entityId: 'vip-customer-42',
agentId: 'support-agent-01',
query: 'billing migration credit',
limit: 10
});
console.log('Stored memories:', searchResults.memories.length);
console.log('Memories the agent should have used:');
searchResults.memories.forEach(m => {
console.log(` - ${m.content} (score: ${m.score})`);
});
// Compare against what appeared in the agent's actual response
// If the memory store has the facts but the response doesn't reflect them,
// you've found a retrieval failure, not a storage failure.This distinction between storage failures and retrieval failures matters for debugging. If the facts aren't in the memory store, your ingestion pipeline is broken. If the facts are in the store but the agent didn't use them, your retrieval mechanism is broken. The fix is different for each case, and output-only evals can't tell you which one you have.
External memory as ground truth
The root cause of the 13.1% recall problem is that most agents rely on in-context memory. Everything the agent "knows" lives in the context window: conversation history, retrieved documents, stored facts, all competing for the same finite token space.
As the context fills up, older information gets pushed out or compressed. The model's attention degrades. Facts from turn 3 are less accessible by turn 30. This isn't a bug. It's how transformer attention works. The model attends to everything in its context, but attention quality degrades as the window fills. Items in the middle of a long context get the least attention, a phenomenon researchers call the "lost in the middle" effect.
External persistent memory solves this by moving ground truth outside the context window entirely. Facts about the customer, their history, their preferences, their tier, the outcomes of previous interactions: these live in a searchable store that doesn't degrade with conversation length. The agent queries this store when it needs information, and the retrieval quality is the same whether it's turn 2 or turn 200.
This is the architectural difference between memory that works in demos and memory that works in production. In a demo, the conversation is short enough that everything fits in context. In production, the customer is on their fourth call, there are 200 stored facts about their account, and the agent needs to retrieve the right three for this specific question. That's a retrieval problem, not a generation problem, and it requires infrastructure outside the model.
The practical effect: instead of hoping the model remembers what it was told 40 turns ago, you have a system of record. The agent queries it. The facts it gets back are the same facts that were stored, not compressed summaries or attention-degraded approximations. When a scorecard asks "did the agent use the correct credit amount," you can verify it against the memory store. When monitoring flags a conversation where the agent contradicted itself, you can trace whether the contradiction came from memory retrieval or from the model's generation.
This doesn't make memory failures impossible. The agent still has to query the right things, and retrieval relevance is its own challenge. But it moves the failure mode from "the model forgot" (undetectable, unfixable) to "the retrieval didn't return the right results" (detectable, debuggable, fixable).
What to do about it
The 13.1% recall finding is uncomfortable, but it points directly at what to fix. Memory failures are detectable, measurable, and addressable. You just need to stop measuring the wrong thing.
Stop treating task completion as a proxy for quality. A completed task with fabricated data is worse than an incomplete task that asks for help. Add memory-specific dimensions to your scorecards: did the agent retrieve stored facts? Did it use the correct context-specific information? Did it reference previous interactions when relevant?
Test at the complexity level your customers actually experience. If your evals only run simple, single-turn interactions, you're testing in the zone where memory mostly works. Build multi-turn scenarios that require cross-referencing facts from previous sessions. That's where the 0.67-to-3.67 failure scaling kicks in, and that's where you'll find the real failure rate.
Separate storage failures from retrieval failures. When a memory-aware eval fails, diagnose which layer broke. Is the fact in the memory store? Then your storage is fine and your retrieval needs work. Is the fact missing from the store entirely? Then your ingestion pipeline dropped it. Different problems, different fixes.
Instrument retrieval, not just responses. Log what the agent retrieved from memory on every turn, not just what it said. When you can compare "facts available" to "facts used," you can measure the actual recall rate for your specific agent on your specific customer interactions. You might find it's better than 13.1%. You might find it's worse. Either way, you'll know.
Use external memory for facts that matter. Customer tier, account history, previous resolutions, escalation contacts, preferences: these facts need to survive context window compression. Moving them to a persistent store outside the model means they're available at the same fidelity on turn 1 and turn 100.
The agents that earn trust won't be the ones with the highest task completion rates. They'll be the ones that remember what the customer told them yesterday. Measuring that requires evaluations that look beyond output quality, into what the agent actually knew when it generated its response.
That VIP customer on her fourth call this week? She doesn't care about your completion metrics. She cares whether the agent knows her name, her plan, her credit, and the escalation path she was promised. The research says there's an 87% chance it doesn't. Fixing that starts with measuring it.
Test what your agent remembers, not just what it says
Build scenarios that require memory across turns. Score whether your agent actually uses what it knows. Track retrieval accuracy alongside task completion.
Start testingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



