How do you grade AI agent calls at scale in a call center?

Automated scorecards evaluate every conversation against defined criteria like accuracy, compliance, tone, and resolution. An AI evaluator scores each call across multiple dimensions independently, producing structured quality data that replaces manual sampling. At 10,000 calls per day, this is the only approach that scales.

What is the difference between human QA sampling and automated scorecard evaluation?

Human QA teams typically sample 2-5% of calls due to time constraints. Automated scorecards evaluate 100% of conversations in near real-time, scoring each against defined rubrics. Human QA catches what it happens to hear. Automated scoring catches patterns across thousands of conversations that no human team could review.

Why do different call types need different scorecards?

A billing inquiry requires verifying the agent gave the correct account balance. A claims call requires checking that the agent collected all required documentation. A complaint call requires evaluating whether the agent acknowledged frustration before problem-solving. Applying a single scorecard to all call types produces unreliable quality data because the definition of 'good' changes with the context.

How do you detect quality regression after a knowledge base update?

Run scenario tests against the agent before and after the update using the same test personas and scorecards. Compare score distributions across dimensions. If accuracy drops on billing scenarios after updating billing procedures, the regression is isolated to that knowledge domain. This pre-post comparison catches problems before they reach customers.

What does a compliance audit trail look like for AI agent conversations?

Every AI conversation is transcribed, scored against compliance-specific scorecard criteria, and stored with full metadata: timestamp, agent version, knowledge base version, scorecard results, and escalation decisions. Regulators can query any conversation and see not just what the agent said, but how it scored on regulatory adherence at the time.

How does memory improve call center AI agent performance?

Persistent memory stores customer interaction history across channels and sessions. When a customer calls back about the same issue, the agent retrieves prior context automatically instead of asking the customer to repeat information. This reduces handle time, improves resolution rates, and prevents the frustration of starting over on every call.

When should an AI agent escalate to a human agent?

Escalation should trigger on emotional signals (customer frustration exceeding a threshold), complexity signals (the agent has attempted resolution twice without success), compliance signals (the conversation involves disputes, legal threats, or fraud indicators), and explicit requests. The escalation must include full conversation context so the human agent does not start from zero.

What is the cost difference between human and AI call center agents?

Industry benchmarks put human agent cost at $3-7 per interaction and AI agent cost at $0.50-1.00 per interaction. But cost savings only hold if AI quality stays high. A poorly monitored AI agent that generates complaints, callbacks, and regulatory issues costs more than the human it replaced.

Your Call Center Handles 10,000 Calls a Day. Who's Grading Them?

Your QA team listens to 200 calls a day. Your call center handles 10,000. That is a 2% sample rate on a good day. On the human side, 2% is standard practice. QA has always been sampling-based because listening to every call is physically impossible.

Then you deployed AI agents for tier-1 inquiries. Policy lookups, claim status, billing questions. The AI handles 40% of volume now, about 4,000 calls per day. And your QA process for those 4,000 conversations? "We check the ones customers complain about."

That is not a quality strategy. That is incident response disguised as quality assurance.

The gap between deploying AI agents and monitoring AI agents is where enterprise reputations break. One bad prompt change can degrade 4,000 conversations before anyone notices. A knowledge base update with incorrect procedure spreads to every call instantly. The AI does not get tired, does not have a bad day, and does not phone it in on a Friday afternoon. But it also does not raise its hand when something feels off. It will confidently give wrong information to 500 customers in the time it takes your QA team to finish their morning coffee.

Metric	Human Agents	AI Agents (Unmonitored)	AI Agents (Scorecard-Graded)
QA coverage	2-5% sampled	Complaints only	100% automated
Time to detect regression	1-2 weeks	When customers escalate	Hours
Cost per interaction	$3-7	$0.50-1.00	$0.50-1.00 + $0.02 scoring
Consistency of evaluation	Varies by reviewer	None	Identical rubric, every call

The monitoring problem changes at scale
Different calls need different scorecards
Building the quality infrastructure
Catching regression before customers do
Memory across channels and callbacks
Human handoff that preserves context
Compliance and the audit trail
The architecture of a monitored call center

The monitoring problem changes at scale

Ten AI calls a day, you can read the transcripts. A hundred, you can skim. A thousand, you start sampling. At 4,000, sampling is a bet that the problems are evenly distributed. They are not.

Quality failures in AI agents cluster. A prompt change that slightly weakens the agent's ability to handle billing disputes will not affect policy lookup calls at all. If your 2% sample happens to land on policy lookups, you will miss the billing regression entirely. Sampling assumes uniform failure distribution. AI agent failures are domain-specific, prompt-specific, and knowledge-specific. This is the silent degradation pattern that makes AI QA fundamentally different from human QA.

One insurance company discovered this the hard way. They updated their knowledge base with new claims procedures. The update was correct for auto claims but contained an error in the homeowner's section. The AI agent started giving incorrect filing deadlines to homeowner's claim callers. Because those calls represented only 15% of total volume, and QA was sampling randomly, the error ran for eleven days before a customer's attorney sent a formal complaint.

Eleven days. Roughly 660 homeowner's claim calls with incorrect filing deadline information. Each one a potential regulatory issue.

The fix is not sampling harder. The fix is grading every call, automatically, against specific quality criteria that match the call type. That is what scorecards are built for.

Different calls need different scorecards

A billing inquiry and a claims filing have almost nothing in common from a quality perspective. "Did the agent give the correct balance?" is a binary question. "Did the agent collect all required documentation for a water damage claim?" is a checklist. "Did the agent acknowledge the customer's frustration before jumping to solutions?" is a judgment call.

Applying a single generic scorecard to all three produces noise. The billing call scores low on "empathy" because the customer asked a straightforward question and the agent gave a straightforward answer. That is not a quality failure. The complaint call scores high on "accuracy" because the agent stated the refund policy correctly. But it scored the refund policy while the customer was still expressing frustration about being overcharged for three months. Accuracy without timing is not quality.

Enterprise call centers need scorecard families. Each call type gets its own rubric:

Call Type	Key Criteria	What "Good" Looks Like
Billing inquiry	Balance accuracy, payment options completeness, promo eligibility	Correct balance, all payment methods offered, applicable discounts surfaced
Claims filing	Required field collection, deadline accuracy, document guidance	All fields captured, correct deadlines, clear next-steps for documentation
Complaint	Emotional acknowledgment, de-escalation before problem-solving, resolution offer	Frustration validated first, then solution, followed by confirmation
Policy change	Eligibility verification, coverage explanation, premium impact disclosure	Eligibility confirmed, coverage deltas explained, cost clearly stated
Spanish-language	Same criteria as English equivalent + language fluency, cultural appropriateness	Natural language, culturally appropriate formality level

Here is how you set up a billing inquiry scorecard with criteria that match what "correct" means for that specific call type:

typescript

import { Chanl } from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY })
 
// Create the billing inquiry scorecard
const { data: billingScorecard } = await chanl.scorecard.create({
  name: 'Billing Inquiry Quality',
  description: 'Evaluates accuracy, completeness, and compliance for billing calls',
  scoringAlgorithm: 'weighted_average',
  passingThreshold: 75,
  industry: 'insurance',
  useCase: 'support',
})
 
// Criterion 1: Did the agent state the correct balance?
await chanl.scorecard.createCriterion(billingScorecard.id, {
  name: 'Balance Accuracy',
  key: 'balance-accuracy',
  type: 'prompt',
  settings: {
    description: 'Agent provided the correct current balance from the billing system, ' +
      'not an estimate or outdated figure',
    evaluationType: 'boolean',
  },
  weight: 35,
})
 
// Criterion 2: Did the agent offer all available payment options?
await chanl.scorecard.createCriterion(billingScorecard.id, {
  name: 'Payment Options Completeness',
  key: 'payment-options',
  type: 'prompt',
  settings: {
    description: 'Agent presented all available payment methods including online portal, ' +
      'phone payment, autopay enrollment, and mail-in options',
    evaluationType: 'score',
  },
  weight: 25,
})
 
// Criterion 3: Did the agent check for applicable discounts?
await chanl.scorecard.createCriterion(billingScorecard.id, {
  name: 'Discount Eligibility Check',
  key: 'discount-check',
  type: 'prompt',
  settings: {
    description: 'Agent proactively checked for multi-policy, loyalty, or promotional ' +
      'discounts applicable to the customer account',
    evaluationType: 'boolean',
  },
  weight: 20,
})
 
// Criterion 4: Compliance - did the agent follow disclosure requirements?
await chanl.scorecard.createCriterion(billingScorecard.id, {
  name: 'Regulatory Disclosure',
  key: 'regulatory-disclosure',
  type: 'prompt',
  settings: {
    description: 'Agent included required state-mandated payment disclosures and ' +
      'did not make unauthorized promises about billing adjustments',
    evaluationType: 'boolean',
  },
  weight: 20,
})

The complaint scorecard looks completely different. The weight shifts from accuracy to emotional intelligence:

typescript

const { data: complaintScorecard } = await chanl.scorecard.create({
  name: 'Complaint Handling Quality',
  description: 'Evaluates de-escalation, empathy, and resolution for complaint calls',
  scoringAlgorithm: 'weighted_average',
  passingThreshold: 70,
  industry: 'insurance',
  useCase: 'support',
})
 
await chanl.scorecard.createCriterion(complaintScorecard.id, {
  name: 'Emotional Acknowledgment',
  key: 'emotional-ack',
  type: 'prompt',
  settings: {
    description: 'Agent recognized and validated the customer emotional state ' +
      'BEFORE attempting to solve the problem. Did not jump to solutions ' +
      'while the customer was still expressing frustration.',
    evaluationType: 'score',
  },
  weight: 35,
})
 
await chanl.scorecard.createCriterion(complaintScorecard.id, {
  name: 'De-escalation Sequence',
  key: 'de-escalation',
  type: 'prompt',
  settings: {
    description: 'Agent followed acknowledge-apologize-act sequence. ' +
      'Did not use dismissive language like "I understand" without specifics.',
    evaluationType: 'score',
  },
  weight: 30,
})
 
await chanl.scorecard.createCriterion(complaintScorecard.id, {
  name: 'Resolution Offered',
  key: 'resolution',
  type: 'prompt',
  settings: {
    description: 'Agent offered a concrete resolution with clear next steps, ' +
      'not a vague promise to "look into it"',
    evaluationType: 'boolean',
  },
  weight: 20,
})
 
await chanl.scorecard.createCriterion(complaintScorecard.id, {
  name: 'Escalation Judgment',
  key: 'escalation-judgment',
  type: 'prompt',
  settings: {
    description: 'If the customer remained dissatisfied after resolution attempt, ' +
      'agent offered human escalation without requiring the customer to demand it',
    evaluationType: 'boolean',
  },
  weight: 15,
})

Notice the weight distribution. Billing: 35% on accuracy, 25% on completeness. Complaints: 35% on emotional acknowledgment, 30% on de-escalation. The definition of "quality" is different because the customer's needs are different. A one-size-fits-all scorecard would penalize accurate billing agents for not being empathetic enough and empathetic complaint handlers for not being comprehensive enough.

Building the quality infrastructure

Scorecards define what "good" looks like. The knowledge base, tools, and agent configuration determine whether the agent can actually deliver it. At enterprise scale, this infrastructure has to support hundreds of documents, dozens of tools, and multiple agent configurations running simultaneously.

The knowledge layer is the foundation. An insurance call center's AI agent needs access to policy documents, billing procedures, claims processes, state-specific regulations, and product descriptions. That is hundreds of documents, not a handful.

typescript

// Upload the claims procedures knowledge base
const { data: claimsKb } = await chanl.knowledge.create({
  title: 'Claims Procedures - 2026',
  source: 'text',
  content: claimsProceduresText,   // loaded from your document store
})
 
// Upload state-specific regulatory requirements
await chanl.knowledge.create({
  title: 'State Regulatory Requirements - California',
  source: 'text',
  content: caDoiRequirementsText,
})
 
// Verify the knowledge base answers correctly
const { data: searchResult } = await chanl.knowledge.search({
  query: 'What is the filing deadline for a homeowner water damage claim in California?',
  limit: 3,
})
 
console.log('Top result:', searchResult?.results[0].content)
// Should return: "California homeowner claims must be filed within 1 year..."

The tool layer connects the agent to live systems. A billing agent that cannot look up the actual account balance is useless regardless of how good its knowledge base is.

Enterprise Call Center Tool Architecture

The MCP connection is what makes this work without ripping out existing infrastructure. Your telephony platform (Genesys, Five9, Amazon Connect) handles the call routing and voice pipeline. Chanl provides the intelligence layer: knowledge retrieval, tool execution, memory, and quality evaluation. The telephony platform does not need to know how the agent decides what to say. It just needs to send the conversation and receive the response.

Catching regression before customers do

The most dangerous moment in a call center AI deployment is not the launch. It is the Tuesday three weeks later when someone updates the knowledge base. Or changes a prompt. Or the LLM provider ships a model update.

At 4,000 calls per day, a regression that drops quality by 15% means 600 degraded conversations per day. If it takes a week to notice, that is 4,200 affected customers. In a regulated industry, some of those conversations may have compliance implications.

Scenario testing is how you catch regressions before they hit production. Define test personas that simulate each call type. Run them before and after every change. Compare scorecard distributions.

typescript

// Define scenario personas for each call type
const billingPersona = {
  name: 'Billing Inquiry - Standard',
  backstory: 'Policyholder checking current balance and due date. ' +
    'Has two active policies (auto and home). Wants to know if autopay ' +
    'discount is available.',
  emotion: 'neutral',
  intentClarity: 'direct',
  speechStyle: 'casual',
}
 
const angryComplainer = {
  name: 'Complaint - Overcharged Three Months',
  backstory: 'Discovered they have been overcharged $47/month for three months ' +
    'after a policy change they did not authorize. Frustrated, wants a refund ' +
    'and an explanation. Will escalate if not satisfied.',
  emotion: 'angry',
  intentClarity: 'direct',
  speechStyle: 'assertive',
}
 
const confusedSenior = {
  name: 'Claims Filing - Confused Senior',
  backstory: '78-year-old filing a homeowner claim for the first time after ' +
    'a kitchen pipe burst. Does not understand insurance jargon. Needs patient, ' +
    'step-by-step guidance. Will ask the same question multiple ways.',
  emotion: 'anxious',
  intentClarity: 'vague',
  speechStyle: 'casual',
}

After updating the knowledge base, run all scenarios and compare:

typescript

// Run all scenarios against the agent after a KB update
const results = await chanl.scenarios.runAll({
  agentId: insuranceAgentId,
  minScore: 70,
})
 
console.log('Scenarios completed:', results.totalScenarios)
console.log('Average score:', results.averageScore)
console.log('Passed:', `${results.passed}/${results.totalScenarios}`)
 
// Compare with the previous baseline
// If average score drops more than 5 points, block the deployment
if (results.averageScore < baselineScore - 5) {
  console.error(`Quality regression detected: ${results.averageScore} vs baseline ${baselineScore}`)
  console.error('Blocking KB update deployment.')
  process.exit(1)
}

Run the full scenario suite on every knowledge base update, every prompt change, and every model version bump. The suite takes minutes to run. The alternative is discovering the regression through customer complaints, which takes days or weeks and comes with regulatory risk.

This is the testing equivalent of CI/CD for AI quality. You would not deploy code without running tests. Do not deploy prompt or knowledge changes without running scenarios.

More on regression testing strategies: Scenario Testing: The QA Strategy That Catches What Unit Tests Miss.

Memory across channels and callbacks

A customer calls Monday about a billing discrepancy. The AI agent explains the charge, the customer says they will check their records and call back. Wednesday, they call again. Without memory, the conversation starts from zero. "Can you give me your policy number? What was the billing issue you're calling about?"

That is the experience customers hate most. They already explained the problem. They expect you to remember.

Persistent memory solves this by storing customer interaction context across calls and channels. When the Wednesday call begins, the agent retrieves Monday's context before its first response:

typescript

// At the start of every call, search for prior interactions
const { data: memories } = await chanl.memory.search({
  entityType: 'customer',
  entityId: customerId,
  query: 'recent billing inquiry',
})
 
if (memories.memories.length > 0) {
  // Inject context into the agent's system prompt
  const priorContext = memories.memories
    .map(m => m.content)
    .join('. ')
 
  console.log('Prior context loaded:', priorContext)
  // "Customer called 2 days ago about a $47/month overcharge on policy #HO-4521.
  //  Agent explained the charge was from a coverage upgrade. Customer said they
  //  did not authorize the upgrade and would check their records."
}

Memory also learns from conversations automatically. After each call ends, the extraction pipeline identifies facts worth remembering:

typescript

// After a call ends, extract and persist facts
const { data: extracted } = await chanl.memory.extract({
  text: callTranscript,
  entityType: 'customer',
  entityId: customerId,
  save: true,   // Persist immediately
})
 
console.log('Facts extracted:', extracted.facts.length)
// [
//   { content: "Disputed $47/month charge on policy HO-4521", confidence: 0.94 },
//   { content: "Claims they did not authorize coverage upgrade on Jan 15", confidence: 0.91 },
//   { content: "Prefers to be called Patricia, not Pat", confidence: 0.87 }
// ]

The third fact is the one that separates functional AI from good AI. The agent did not just capture the billing dispute. It captured a personal preference. Next time Patricia calls, the agent greets her correctly. That is the kind of detail human agents remember about their regulars. AI agents can remember it about every customer, at scale.

The memory lifecycle for a call center interaction:

Call starts. Agent queries memory for this customer's prior interactions.
During the call. Critical facts are stored in real-time (explicit statements, preferences, unresolved issues).
Call ends. Extraction pipeline processes the full transcript, identifies new facts, and persists them with confidence scores.
Next call. The cycle repeats, but now with richer context.

Over time, the agent builds a comprehensive customer profile without anyone manually entering data. The customer who has called three times about a claim gets an agent that already knows the claim number, the adjuster's name, and which documents are still outstanding. For a deeper look at the engineering behind persistent memory: Build Your Own AI Agent Memory System.

Human handoff that preserves context

AI agents should not handle everything. The question is not whether to escalate but when and how.

"When" has three categories:

Emotional threshold. The customer's frustration has exceeded what the AI can de-escalate. This is not just detecting the word "angry." It is tracking escalation signals across the conversation: repeated objections, raised voice indicators in speech-to-text, explicit requests to speak to a manager, and language patterns that indicate the customer has lost confidence in the AI.

Complexity threshold. The agent has attempted resolution twice without converging on a solution. A billing question that should take two turns but has reached eight turns is not going to resolve. The agent's knowledge or tools are insufficient for this specific case.

Compliance threshold. The conversation involves a formal dispute, a legal threat, a fraud indicator, or a request for information the AI is not authorized to provide. These conversations need a human decision-maker, not because the AI cannot technically handle them, but because the regulatory framework requires human judgment.

"How" matters as much as "when." A bad handoff loses everything the customer already said. They repeat their story, their frustration compounds, and the human agent starts from behind.

The agent's escalation tool should bundle the full conversation context:

typescript

// Escalation tool schema (connected via MCP)
const escalationTool = {
  name: 'escalate_to_human',
  description: 'Transfer the call to a human agent with full context',
  parameters: {
    reason: 'string',           // Why the AI is escalating
    category: 'string',         // billing_dispute | claims_complex | compliance | customer_request
    customerSentiment: 'string', // frustrated | angry | confused | neutral
    summaryForAgent: 'string',  // 2-3 sentence summary for the human agent
    attemptedResolutions: 'string[]', // What the AI already tried
    unresolvedIssues: 'string[]',     // What remains open
  },
}

When the human agent picks up, they see: "Customer Patricia has called twice about a $47/month overcharge on policy HO-4521. She says she did not authorize a coverage upgrade on January 15. AI offered a billing review and a credit, both rejected. Customer is requesting a full three-month refund and wants to speak with a manager about the unauthorized change."

The human agent does not ask "How can I help you today?" They say "Hi Patricia, I see you've been dealing with an overcharge on your homeowner's policy. Let me pull up the authorization records for that January upgrade."

That is the difference between handoff and context-preserving handoff. One generates a complaint. The other generates a resolution.

Compliance and the audit trail

In insurance, healthcare, and financial services, "we think the AI did a good job" is not sufficient for regulators. They want evidence. Every AI conversation needs a verifiable audit trail: what was said, how it was scored, and whether compliance criteria were met.

The scorecard system produces this audit trail automatically. Every call is transcribed, evaluated against the relevant scorecard, and stored with complete metadata:

Conversation transcript with timestamps
Scorecard results with per-criterion scores and AI evaluator reasoning
Agent configuration snapshot (prompt version, model version, knowledge base version)
Tool call log (what backend systems were queried, what data was returned)
Escalation record (if applicable: when, why, and the context package sent to the human agent)

typescript

// Evaluate a call against the compliance-specific scorecard
const { data: evaluation } = await chanl.scorecard.evaluate(callId, {
  scorecardId: complianceScorecard.id,
})
 
// Retrieve the detailed results
const { data: results } = await chanl.scorecard.getResultsByCall(callId)
const result = results.results[0]
 
console.log('Compliance score:', result.overallScore)
for (const cr of result.criteriaResults) {
  console.log(`  ${cr.criteriaKey}: ${cr.result} (${cr.passed ? 'PASS' : 'FAIL'})`)
  console.log(`  Reasoning: ${cr.reasoning}`)
  // "The agent correctly disclosed the 30-day dispute window per CA DOI
  //  regulation 2695.7(b). However, the agent did not mention the right
  //  to request an independent appraisal, which is required for claims
  //  exceeding $5,000."
}

The reasoning field is critical for audits. It is not enough to know a criterion passed or failed. Regulators want to know why. The AI evaluator explains its scoring in plain language, citing specific moments in the conversation where the agent met or missed the requirement. For evidence on whether AI scoring matches human expert judgment: Are AI Models Better Call Scorers Than Humans?

This also creates a feedback loop for improvement. When the compliance criterion for "independent appraisal disclosure" starts failing at a higher rate, you know exactly which knowledge base section to update. You are not guessing which procedure is wrong. The scorecard data tells you.

For a deeper look at regulatory compliance in AI-handled conversations: Voice AI in Regulated Industries.

The architecture of a monitored call center

Here is the full picture. Not a simplified diagram, but the actual system architecture for an enterprise call center running AI agents with comprehensive quality monitoring:

Enterprise Call Center AI Architecture

The key insight in this architecture is that quality monitoring is not a feature added on top. It is a parallel system that runs on every call. The AI agent handles the conversation. The scorecard system evaluates the conversation. The monitoring layer aggregates evaluations into trends, alerts, and reports.

This is different from traditional call center QA in a fundamental way. Traditional QA is sampling-based and retrospective. You listen to calls after the fact and hope your sample is representative. Automated scorecard evaluation is comprehensive and near real-time. Every call is graded. Regressions surface in hours, not weeks.

The cost of this quality layer is roughly $0.02 per call for the AI evaluator. At 4,000 AI calls per day, that is $80 per day, or about $2,400 per month. Compare that to the cost of a single compliance violation, a single viral complaint, or a single week of degraded quality affecting thousands of customers. The monitoring infrastructure pays for itself on the first regression it catches.

What to build first

If you are running AI agents in a call center today without comprehensive quality monitoring, here is the order of operations:

Week 1: Scorecards. Define scorecards for your top three call types by volume. Start with the criteria that map directly to customer satisfaction and regulatory requirements. Do not try to measure everything. Measure what matters most for each call type.

Week 2: Baseline. Run the scorecards against a sample of recent AI conversations to establish your quality baseline. This is your "before" measurement. You will compare all future changes against it.

Week 3: Scenarios. Build test personas for each call type. At minimum: a straightforward caller, an edge-case caller, and an emotional caller. Wire the scenarios into your deployment pipeline so they run automatically before any prompt or knowledge base change goes live.

Week 4: Monitoring. Set up the quality dashboard with per-scorecard trend lines and regression alerts. Configure alerts for any dimension dropping more than 5 points below baseline. Review the first weekly quality report with your team.

That is four weeks from "we check the ones customers complain about" to "every call is graded, every regression is caught, and we have an audit trail for regulators."

The AI agents are already deployed. The calls are already happening. The only question is whether you are grading them or hoping for the best.

Grade Every AI Conversation, Not Just the Complaints

Automated scorecards evaluate accuracy, compliance, and tone across every call. See quality trends before customers see problems.

See How It Works

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

call-center enterprise scorecards monitoring quality scenarios memory compliance

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.