ChanlChanl
Tools & MCP

Your AI Assistant Works in Demo. Then What?

Test your AI shopping assistant with AI personas that simulate real customer segments, score conversations with objective scorecards, and monitor production metrics that matter for ecommerce.

DGDean GroverCo-founderFollow
March 21, 2026
13 min read
Warm watercolor illustration of a control room monitoring shopping conversations

Your AI shopping assistant nails the demo. You ask for hiking boots, it returns three cards with images, prices, and an "Add to Cart" button. Everything looks polished. But what happens when a price-obsessed bargain hunter says "too expensive, show me something else"? Or a return customer says "something like the ones I got last year, but in blue"? You need to find out before they do.

Part 3 of 3. Part 1 covered widget rendering. Part 2 built the intelligence layer. Now: three steps from demo to production-tested.

Persona Testing: Simulating Real Shoppers

Create AI personas that shop like real customers, then let them loose on your agent. Unit tests check code paths. They can verify your product search API returns 200 OK with three hiking boots. They cannot tell you if those were the right hiking boots, or if the model returned running shoes because it matched on "shoes." That's the gap. AI personas close it by simulating actual customer behavior: asking questions, pushing back on suggestions, and following up with context that only a real shopper would bring.

You don't script the conversations. You define who the shopper is, give them a backstory and emotional state, then let the persona interact naturally. Think of it as improv theater where the AI plays the customer and your agent plays the salesperson. The persona has motivations, budget constraints, and emotional triggers. Your agent has to respond to whatever the persona throws at it.

Four personas cover most ecommerce edge cases:

PersonaKey TraitsWhat It Tests
Budget ShopperPrice-sensitive, compares aggressively, asks about salesPrice filtering, cheaper alternatives, budget awareness
Luxury BuyerBrand-conscious, cares about materials, reads reviewsQuality signals, brand recommendations, premium positioning
Gift ShopperBuying for someone else, uncertain, needs guidanceClarifying questions, safe suggestions, gift wrapping
Return CustomerReferences past purchases, expects recognitionMemory retrieval, preference awareness, personalization

Why these four? They stress-test different capabilities. The budget shopper tests price awareness and alternative suggestions. The luxury buyer tests whether the agent can surface quality signals like materials and craftsmanship instead of defaulting to "most popular." The gift shopper tests conversational guidance. And the return customer tests whether the memory pipeline from Part 2 actually delivers personalized recommendations or just generic ones.

A budget-conscious parent and a luxury interior designer push your agent in completely opposite directions. Here's how you define them:

typescript
const personas = [
  {
    name: "Budget Shopper",
    emotion: "stressed",       // Emotion shapes how the persona phrases requests
    backstory: "Single parent shopping for school supplies. Every dollar matters. " +
      "Will ask about sales, compare prices, and push back if suggestions exceed budget.",
    intentClarity: "direct",   // Knows what they want — tests price filtering
    speechStyle: "casual",
  },
  {
    name: "Luxury Buyer",
    emotion: "calm",
    backstory: "Interior designer sourcing premium furniture for a client project. " +
      "Cares about materials, brand reputation, and craftsmanship. Price is secondary.",
    intentClarity: "detailed",  // Tests whether agent handles rich queries
    speechStyle: "formal",
  },
  {
    name: "Gift Shopper",
    emotion: "curious",
    backstory: "Buying a birthday gift for a teenage niece. Has no idea what teenagers " +
      "like right now. Needs the agent to guide the conversation.",
    intentClarity: "vague",     // Tests clarifying questions — does the agent ask or guess?
    speechStyle: "casual",
  },
  {
    name: "Return Customer",
    emotion: "friendly",
    backstory: "Has purchased running shoes twice before. Prefers lightweight trail runners " +
      "under $150. Expects the agent to remember past preferences.",
    intentClarity: "direct",
    speechStyle: "casual",      // Tests memory recall from Part 2
  },
];

Each persona becomes a scenario that the engine runs autonomously. No scripted messages, no predetermined paths:

typescript
import Chanl from "@chanl/sdk";
 
const sdk = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Persona runs autonomously — no scripted messages
const { data } = await sdk.scenarios.run(budgetShopperScenarioId, {
  agentId: shoppingAssistantId,
  simulationMode: "text",  // "voice" for testing voice agents
});
 
console.log("Execution started:", data.executionId);
 
const result = await sdk.scenarios.getExecution(data.executionId);
console.log("Score:", result.data.execution.overallScore);  // 0-100

The budget shopper just asked for those hiking boots from Part 1, but under $50. Did your agent find cheaper alternatives in the catalog, or did it apologize and give up? The gift shopper said "I don't know, something a teenager would like?" and tested whether the agent asked clarifying questions (age range, interests, budget) or dumped a generic list of bestsellers. The return customer said "like the ones I got last year but in blue" and tested whether the memory from Part 2 actually recalled their purchase history.

Run all four personas before every release. Each one takes about 30 seconds. If the budget shopper passes but the return customer fails, you know your memory retrieval pipeline has a bug. If the gift shopper passes but the luxury buyer fails, your agent handles vague queries well but chokes on detailed material specifications.

But you still don't know why a conversation scored 60 instead of 90. That's what scorecards measure.

Scorecards: Replacing Vibes with Metrics

Five dimensions, scored by AI evaluators, tracked over time. "Felt good" is not a metric. After ten persona runs, your team will have ten opinions about whether the agent "did well." Those opinions will contradict each other. Scorecards replace subjective impressions with structured, repeatable evaluation so you can point to a number instead of a feeling.

Here are the five dimensions that matter for ecommerce:

DimensionTypeWhat It Measures
Product RelevanceScore (0-10)Did recommendations match the customer's stated needs?
Price CompliancePass/FailDid the agent stay within the customer's budget constraints?
Recovery QualityScore (0-10)After rejection, did follow-up suggestions feel meaningfully different?
Memory UsagePass/FailDid the agent reference stored preferences when available?
Conversion SignalsScore (0-10)Did the agent include clear next steps (add to cart, view details, compare)?

Define the scorecard:

typescript
const { data: scorecard } = await sdk.scorecards.create({
  name: "Shopping Assistant Quality",
  description: "Evaluates product recommendations, price awareness, and recovery",
  status: "active",
  passingThreshold: 70,              // Below 70 = failed QA
  scoringAlgorithm: "weighted_average",  // Relevance weighted higher than recovery
  industry: "ecommerce",
  useCase: "support",
  // Criteria (relevance, price compliance, recovery) configured
  // in the dashboard or via follow-up API calls
});

After a persona scenario runs, evaluate the conversation:

typescript
// Run AI evaluator against the full conversation transcript
const { data: evaluation } = await sdk.scorecards.evaluate(callId, {
  scorecardId: scorecard.id,
});
 
const { data: results } = await sdk.scorecards.getResultsByCall(callId);
const result = results.results[0];
console.log("Overall score:", result.overallScore);
for (const cr of result.criteriaResults) {
  console.log(`  ${cr.criteriaKey}: ${cr.result} (${cr.passed ? "PASS" : "FAIL"})`);
  console.log(`  Reasoning: ${cr.reasoning}`);  // AI explains WHY it scored this way
}

You don't get a single thumbs-up. You get a breakdown: product relevance was 8/10, but recovery quality scored 4/10 because the agent kept suggesting the same category after the customer asked for something different. That Trailblazer Pro from Part 1? If the scorecard catches the agent recommending it after the customer said "no hiking boots," you know exactly which prompt to fix.

This is what makes scorecard evaluation different from traditional A/B testing. A/B tests tell you which version converts better. Scorecards tell you why. Was it the product relevance? The price compliance? The recovery after rejection? Each dimension isolates a specific capability so you can improve it without guessing.

Wire scorecard evaluation into your CI pipeline. Run four personas, evaluate all four conversations, and fail the build if any score drops below your passing threshold. Now you have a regression test for conversation quality, not just code correctness.

More on scorecard design: Scorecards vs. Vibes: How to Actually Measure AI Agent Quality.

Quality analyst reviewing scores
Score
Good
0/100
Tone & Empathy
94%
Resolution
88%
Response Time
72%
Compliance
85%

Production Monitoring: Five Metrics That Predict Customer Satisfaction

Testing catches problems before launch. Monitoring catches them after. No persona set covers every real-world conversation, and customer behavior at scale always surprises you. Someone will ask for a product category you never considered. Someone will phrase a request in a way that bypasses your search entirely. Standard infrastructure metrics (uptime, latency, error rates) are table stakes. These five behavioral metrics predict customer experience problems before they show up in reviews:

MetricHealthy RangeAlert ThresholdWhat It Means
Recommendation acceptance>30%<20%Customers aren't clicking. Your relevance model needs work.
Conversation depth3-7 turns>10 avg turnsAgent is looping or missing intent.
Recovery rate>40%<25%Agent can't adapt after rejection. Fix your prompts.
Memory hit rate>60%<40%Memory integration is broken. Debug the retrieval pipeline.
Fallback rate<10%>15%Catalog has gaps. Investigate by category.

Recommendation acceptance is the most direct signal. If customers consistently ignore what the agent suggests, the relevance model is not working regardless of what your persona test scores say. Check whether your product data (images, descriptions, pricing) is complete. A product card without an image gets skipped.

Conversation depth acts as a canary. Extremely short conversations (1-2 turns) mean the agent is not engaging. Extremely long ones (10+ turns) mean it is not converging on useful suggestions. The sweet spot is 3-7 turns: enough to understand intent, not so many that the customer gives up.

Recovery rate measures resilience. When a customer says "no, not that," does the next suggestion lead somewhere productive? Low recovery rates mean the agent does not adapt to feedback. This is almost always a prompt problem.

Memory hit rate matters most for return customers. If your knowledge base has customer data but the agent is not referencing it, the memory integration needs debugging. This is the metric that separates a generic chatbot from a personal shopping assistant.

Fallback rate exposes catalog gaps. Every "I couldn't find anything matching your request" is a missed sale. Track fallbacks by product category to find exactly where the holes are.

Track these daily with analytics dashboards. A single bad day is noise. A week-long trend is a signal.

More on monitoring strategy: Real-Time Monitoring for AI Agents.

The Complete Stack

Here's the full architecture across all three parts:

Part 1: Widget Layer Part 2: Intelligence Layer Part 3: Testing + Production Chat Interface Vercel AI SDK Zod Tool Schemas Product Card Renderer Comparison Widget Cart Actions Commerce MCP Server Product Search Tool Inventory Check Tool Price Lookup Tool Knowledge Base Product Catalog RAG Customer Memory Preference Store Purchase History AI Personas Budget Shopper Luxury Buyer Gift Shopper Return Customer Scorecards Relevance Score Price Compliance Recovery Quality Production Monitoring Acceptance Rate Fallback Rate Memory Hit Rate
Complete AI Shopping Assistant Stack

Three articles ago, the AI returned text. Now it renders product cards from real catalogs, remembers the customer across visits, and gets tested by AI shoppers before a single real customer sees it.

The widget layer from Part 1 makes recommendations visible. The intelligence layer from Part 2 makes them relevant. The testing and monitoring layer from this article makes them reliable. Each layer reinforces the others. Beautiful product cards do not matter if the recommendations are poor. Smart recommendations do not matter if you cannot verify them across customer segments. And testing before launch does not matter if you are not watching what happens after.

Build smarter shopping agents

Chanl gives AI shopping assistants the backend they need: product knowledge, customer memory, MCP tools, and testing. The rendering is yours. The intelligence is handled.

Start building
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos

Frequently Asked Questions