How do you test an AI shopping assistant before launch?

Create AI personas that simulate real customer segments like budget shoppers, luxury buyers, and gift shoppers. Each persona runs autonomous conversations with your agent, testing edge cases that manual QA would miss. Score the results with objective scorecards that measure product relevance, price compliance, and conversation recovery.

What are AI persona tests for ecommerce?

AI persona tests use simulated customer profiles with specific traits (budget sensitivity, brand loyalty, decisiveness) and shopping goals. The persona interacts with your shopping assistant autonomously, making requests, evaluating responses, and pushing edge cases that real customers would encounter.

How do scorecards evaluate AI agent quality?

Scorecards run an AI evaluator against conversation transcripts, scoring them on dimensions you define. For ecommerce, key dimensions include product relevance, price compliance, conversational recovery quality, memory usage, and conversion signals. Each gets a numeric score or pass/fail result.

What metrics should I monitor for an AI chatbot in production?

Track recommendation acceptance rate, conversation depth before purchase or abandonment, recovery rate after rejected recommendations, memory hit rate, and fallback rate. Set alerts on fallback rate above 15% and recommendation acceptance below 20%.

Why can't unit tests validate AI shopping recommendations?

Unit tests check code paths and integration tests verify API contracts, but neither evaluates whether recommendations are relevant, whether the agent handles rejection gracefully, or whether it adapts to different customer types. AI agent testing requires simulated customer behavior and subjective quality evaluation.

What is a scorecard in AI agent evaluation?

A scorecard is a structured evaluation framework that scores AI agent conversations against predefined criteria. Instead of subjective 'vibe checks,' scorecards provide quantifiable metrics like relevance scores, compliance rates, and recovery quality that can be tracked over time.

Your AI Assistant Works in Demo. Then What?

Your AI shopping assistant nails the demo. You ask for hiking boots, it returns three cards with images, prices, and an "Add to Cart" button. Everything looks polished. But what happens when a price-obsessed bargain hunter says "too expensive, show me something else"? Or a return customer says "something like the ones I got last year, but in blue"? You need to find out before they do.

Part 3 of 3. Part 1 covered widget rendering. Part 2 built the intelligence layer. Now: three steps from demo to production-tested.

Persona Testing: Simulating Real Shoppers

Create AI personas that shop like real customers, then let them loose on your agent. Unit tests check code paths. They can verify your product search API returns 200 OK with three hiking boots. They cannot tell you if those were the right hiking boots, or if the model returned running shoes because it matched on "shoes." That's the gap. AI personas close it by simulating actual customer behavior: asking questions, pushing back on suggestions, and following up with context that only a real shopper would bring.

You don't script the conversations. You define who the shopper is, give them a backstory and emotional state, then let the persona interact naturally. Think of it as improv theater where the AI plays the customer and your agent plays the salesperson. The persona has motivations, budget constraints, and emotional triggers. Your agent has to respond to whatever the persona throws at it.

Four personas cover most ecommerce edge cases:

Persona	Key Traits	What It Tests
Budget Shopper	Price-sensitive, compares aggressively, asks about sales	Price filtering, cheaper alternatives, budget awareness
Luxury Buyer	Brand-conscious, cares about materials, reads reviews	Quality signals, brand recommendations, premium positioning
Gift Shopper	Buying for someone else, uncertain, needs guidance	Clarifying questions, safe suggestions, gift wrapping
Return Customer	References past purchases, expects recognition	Memory retrieval, preference awareness, personalization

Why these four? They stress-test different capabilities. The budget shopper tests price awareness and alternative suggestions. The luxury buyer tests whether the agent can surface quality signals like materials and craftsmanship instead of defaulting to "most popular." The gift shopper tests conversational guidance. And the return customer tests whether the memory pipeline from Part 2 actually delivers personalized recommendations or just generic ones.

A budget-conscious parent and a luxury interior designer push your agent in completely opposite directions. Here's how you define them:

typescript

const personas = [
  {
    name: "Budget Shopper",
    emotion: "stressed",       // Emotion shapes how the persona phrases requests
    backstory: "Single parent shopping for school supplies. Every dollar matters. " +
      "Will ask about sales, compare prices, and push back if suggestions exceed budget.",
    intentClarity: "direct",   // Knows what they want — tests price filtering
    speechStyle: "casual",
  },
  {
    name: "Luxury Buyer",
    emotion: "calm",
    backstory: "Interior designer sourcing premium furniture for a client project. " +
      "Cares about materials, brand reputation, and craftsmanship. Price is secondary.",
    intentClarity: "detailed",  // Tests whether agent handles rich queries
    speechStyle: "formal",
  },
  {
    name: "Gift Shopper",
    emotion: "curious",
    backstory: "Buying a birthday gift for a teenage niece. Has no idea what teenagers " +
      "like right now. Needs the agent to guide the conversation.",
    intentClarity: "vague",     // Tests clarifying questions — does the agent ask or guess?
    speechStyle: "casual",
  },
  {
    name: "Return Customer",
    emotion: "friendly",
    backstory: "Has purchased running shoes twice before. Prefers lightweight trail runners " +
      "under $150. Expects the agent to remember past preferences.",
    intentClarity: "direct",
    speechStyle: "casual",      // Tests memory recall from Part 2
  },
];

Each persona becomes a scenario that the engine runs autonomously. No scripted messages, no predetermined paths:

typescript

import Chanl from "@chanl/sdk";
 
const sdk = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Persona runs autonomously — no scripted messages
const { data } = await sdk.scenarios.run(budgetShopperScenarioId, {
  agentId: shoppingAssistantId,
  simulationMode: "text",  // "voice" for testing voice agents
});
 
console.log("Execution started:", data.executionId);
 
const result = await sdk.scenarios.getExecution(data.executionId);
console.log("Score:", result.data.execution.overallScore);  // 0-100

The budget shopper just asked for those hiking boots from Part 1, but under $50. Did your agent find cheaper alternatives in the catalog, or did it apologize and give up? The gift shopper said "I don't know, something a teenager would like?" and tested whether the agent asked clarifying questions (age range, interests, budget) or dumped a generic list of bestsellers. The return customer said "like the ones I got last year but in blue" and tested whether the memory from Part 2 actually recalled their purchase history.

Run all four personas before every release. Each one takes about 30 seconds. If the budget shopper passes but the return customer fails, you know your memory retrieval pipeline has a bug. If the gift shopper passes but the luxury buyer fails, your agent handles vague queries well but chokes on detailed material specifications.

But you still don't know why a conversation scored 60 instead of 90. That's what scorecards measure.

Scorecards: Replacing Vibes with Metrics

Five dimensions, scored by AI evaluators, tracked over time. "Felt good" is not a metric. After ten persona runs, your team will have ten opinions about whether the agent "did well." Those opinions will contradict each other. Scorecards replace subjective impressions with structured, repeatable evaluation so you can point to a number instead of a feeling.

Here are the five dimensions that matter for ecommerce:

Dimension	Type	What It Measures
Product Relevance	Score (0-10)	Did recommendations match the customer's stated needs?
Price Compliance	Pass/Fail	Did the agent stay within the customer's budget constraints?
Recovery Quality	Score (0-10)	After rejection, did follow-up suggestions feel meaningfully different?
Memory Usage	Pass/Fail	Did the agent reference stored preferences when available?
Conversion Signals	Score (0-10)	Did the agent include clear next steps (add to cart, view details, compare)?

Define the scorecard:

typescript

const { data: scorecard } = await sdk.scorecards.create({
  name: "Shopping Assistant Quality",
  description: "Evaluates product recommendations, price awareness, and recovery",
  status: "active",
  passingThreshold: 70,              // Below 70 = failed QA
  scoringAlgorithm: "weighted_average",  // Relevance weighted higher than recovery
  industry: "ecommerce",
  useCase: "support",
  // Criteria (relevance, price compliance, recovery) configured
  // in the dashboard or via follow-up API calls
});

After a persona scenario runs, evaluate the conversation:

typescript

// Run AI evaluator against the full conversation transcript
const { data: evaluation } = await sdk.scorecards.evaluate(callId, {
  scorecardId: scorecard.id,
});
 
const { data: results } = await sdk.scorecards.getResultsByCall(callId);
const result = results.results[0];
console.log("Overall score:", result.overallScore);
for (const cr of result.criteriaResults) {
  console.log(`  ${cr.criteriaKey}: ${cr.result} (${cr.passed ? "PASS" : "FAIL"})`);
  console.log(`  Reasoning: ${cr.reasoning}`);  // AI explains WHY it scored this way
}

You don't get a single thumbs-up. You get a breakdown: product relevance was 8/10, but recovery quality scored 4/10 because the agent kept suggesting the same category after the customer asked for something different. That Trailblazer Pro from Part 1? If the scorecard catches the agent recommending it after the customer said "no hiking boots," you know exactly which prompt to fix.

This is what makes scorecard evaluation different from traditional A/B testing. A/B tests tell you which version converts better. Scorecards tell you why. Was it the product relevance? The price compliance? The recovery after rejection? Each dimension isolates a specific capability so you can improve it without guessing.

Wire scorecard evaluation into your CI pipeline. Run four personas, evaluate all four conversations, and fail the build if any score drops below your passing threshold. Now you have a regression test for conversation quality, not just code correctness.

More on scorecard design: Scorecards vs. Vibes: How to Actually Measure AI Agent Quality.

Score

Good

0/100

Tone & Empathy

94%

Resolution

88%

Response Time

72%

Compliance

85%

Production Monitoring: Five Metrics That Predict Customer Satisfaction

Testing catches problems before launch. Monitoring catches them after. No persona set covers every real-world conversation, and customer behavior at scale always surprises you. Someone will ask for a product category you never considered. Someone will phrase a request in a way that bypasses your search entirely. Standard infrastructure metrics (uptime, latency, error rates) are table stakes. These five behavioral metrics predict customer experience problems before they show up in reviews:

Metric	Healthy Range	Alert Threshold	What It Means
Recommendation acceptance	>30%	<20%	Customers aren't clicking. Your relevance model needs work.
Conversation depth	3-7 turns	>10 avg turns	Agent is looping or missing intent.
Recovery rate	>40%	<25%	Agent can't adapt after rejection. Fix your prompts.
Memory hit rate	>60%	<40%	Memory integration is broken. Debug the retrieval pipeline.
Fallback rate	<10%	>15%	Catalog has gaps. Investigate by category.

Recommendation acceptance is the most direct signal. If customers consistently ignore what the agent suggests, the relevance model is not working regardless of what your persona test scores say. Check whether your product data (images, descriptions, pricing) is complete. A product card without an image gets skipped.

Conversation depth acts as a canary. Extremely short conversations (1-2 turns) mean the agent is not engaging. Extremely long ones (10+ turns) mean it is not converging on useful suggestions. The sweet spot is 3-7 turns: enough to understand intent, not so many that the customer gives up.

Recovery rate measures resilience. When a customer says "no, not that," does the next suggestion lead somewhere productive? Low recovery rates mean the agent does not adapt to feedback. This is almost always a prompt problem.

Memory hit rate matters most for return customers. If your knowledge base has customer data but the agent is not referencing it, the memory integration needs debugging. This is the metric that separates a generic chatbot from a personal shopping assistant.

Fallback rate exposes catalog gaps. Every "I couldn't find anything matching your request" is a missed sale. Track fallbacks by product category to find exactly where the holes are.

Track these daily with analytics dashboards. A single bad day is noise. A week-long trend is a signal.

More on monitoring strategy: Real-Time Monitoring for AI Agents.

The Complete Stack

Here's the full architecture across all three parts:

Complete AI Shopping Assistant Stack

Three articles ago, the AI returned text. Now it renders product cards from real catalogs, remembers the customer across visits, and gets tested by AI shoppers before a single real customer sees it.

The widget layer from Part 1 makes recommendations visible. The intelligence layer from Part 2 makes them relevant. The testing and monitoring layer from this article makes them reliable. Each layer reinforces the others. Beautiful product cards do not matter if the recommendations are poor. Smart recommendations do not matter if you cannot verify them across customer segments. And testing before launch does not matter if you are not watching what happens after.

Build smarter shopping agents

Chanl gives AI shopping assistants the backend they need: product knowledge, customer memory, MCP tools, and testing. The rendering is yours. The intelligence is handled.

Start building

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

ecommerce testing scorecards scenarios monitoring analytics personas shopping-assistant

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos

Your AI Assistant Works in Demo. Then What?

Persona Testing: Simulating Real Shoppers

Scorecards: Replacing Vibes with Metrics

Production Monitoring: Five Metrics That Predict Customer Satisfaction

The Complete Stack

Build smarter shopping agents

Aprende IA Agéntica

Frequently Asked Questions

Related Articles

Why AI Shopping Still Feels Like a Search Bar

Your Agent Aced the Benchmark. Production Disagreed.

The Voice AI Quality Crisis: Why Most Deployments Fail in Production