Your AI shopping assistant nails the demo. You ask for hiking boots, it returns three cards with images, prices, and an "Add to Cart" button. Everything looks polished. But what happens when a price-obsessed bargain hunter says "too expensive, show me something else"? Or a return customer says "something like the ones I got last year, but in blue"? You need to find out before they do.
Part 3 of 3. Part 1 covered widget rendering. Part 2 built the intelligence layer. Now: three steps from demo to production-tested.
Persona Testing: Simulating Real Shoppers
Create AI personas that shop like real customers, then let them loose on your agent. Unit tests check code paths. They can verify your product search API returns 200 OK with three hiking boots. They cannot tell you if those were the right hiking boots, or if the model returned running shoes because it matched on "shoes." That's the gap. AI personas close it by simulating actual customer behavior: asking questions, pushing back on suggestions, and following up with context that only a real shopper would bring.
You don't script the conversations. You define who the shopper is, give them a backstory and emotional state, then let the persona interact naturally. Think of it as improv theater where the AI plays the customer and your agent plays the salesperson. The persona has motivations, budget constraints, and emotional triggers. Your agent has to respond to whatever the persona throws at it.
Four personas cover most ecommerce edge cases:
| Persona | Key Traits | What It Tests |
|---|---|---|
| Budget Shopper | Price-sensitive, compares aggressively, asks about sales | Price filtering, cheaper alternatives, budget awareness |
| Luxury Buyer | Brand-conscious, cares about materials, reads reviews | Quality signals, brand recommendations, premium positioning |
| Gift Shopper | Buying for someone else, uncertain, needs guidance | Clarifying questions, safe suggestions, gift wrapping |
| Return Customer | References past purchases, expects recognition | Memory retrieval, preference awareness, personalization |
Why these four? They stress-test different capabilities. The budget shopper tests price awareness and alternative suggestions. The luxury buyer tests whether the agent can surface quality signals like materials and craftsmanship instead of defaulting to "most popular." The gift shopper tests conversational guidance. And the return customer tests whether the memory pipeline from Part 2 actually delivers personalized recommendations or just generic ones.
A budget-conscious parent and a luxury interior designer push your agent in completely opposite directions. Here's how you define them:
const personas = [
{
name: "Budget Shopper",
emotion: "stressed", // Emotion shapes how the persona phrases requests
backstory: "Single parent shopping for school supplies. Every dollar matters. " +
"Will ask about sales, compare prices, and push back if suggestions exceed budget.",
intentClarity: "direct", // Knows what they want — tests price filtering
speechStyle: "casual",
},
{
name: "Luxury Buyer",
emotion: "calm",
backstory: "Interior designer sourcing premium furniture for a client project. " +
"Cares about materials, brand reputation, and craftsmanship. Price is secondary.",
intentClarity: "detailed", // Tests whether agent handles rich queries
speechStyle: "formal",
},
{
name: "Gift Shopper",
emotion: "curious",
backstory: "Buying a birthday gift for a teenage niece. Has no idea what teenagers " +
"like right now. Needs the agent to guide the conversation.",
intentClarity: "vague", // Tests clarifying questions — does the agent ask or guess?
speechStyle: "casual",
},
{
name: "Return Customer",
emotion: "friendly",
backstory: "Has purchased running shoes twice before. Prefers lightweight trail runners " +
"under $150. Expects the agent to remember past preferences.",
intentClarity: "direct",
speechStyle: "casual", // Tests memory recall from Part 2
},
];Each persona becomes a scenario that the engine runs autonomously. No scripted messages, no predetermined paths:
import Chanl from "@chanl/sdk";
const sdk = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Persona runs autonomously — no scripted messages
const { data } = await sdk.scenarios.run(budgetShopperScenarioId, {
agentId: shoppingAssistantId,
simulationMode: "text", // "voice" for testing voice agents
});
console.log("Execution started:", data.executionId);
const result = await sdk.scenarios.getExecution(data.executionId);
console.log("Score:", result.data.execution.overallScore); // 0-100The budget shopper just asked for those hiking boots from Part 1, but under $50. Did your agent find cheaper alternatives in the catalog, or did it apologize and give up? The gift shopper said "I don't know, something a teenager would like?" and tested whether the agent asked clarifying questions (age range, interests, budget) or dumped a generic list of bestsellers. The return customer said "like the ones I got last year but in blue" and tested whether the memory from Part 2 actually recalled their purchase history.
Run all four personas before every release. Each one takes about 30 seconds. If the budget shopper passes but the return customer fails, you know your memory retrieval pipeline has a bug. If the gift shopper passes but the luxury buyer fails, your agent handles vague queries well but chokes on detailed material specifications.
But you still don't know why a conversation scored 60 instead of 90. That's what scorecards measure.
Scorecards: Replacing Vibes with Metrics
Five dimensions, scored by AI evaluators, tracked over time. "Felt good" is not a metric. After ten persona runs, your team will have ten opinions about whether the agent "did well." Those opinions will contradict each other. Scorecards replace subjective impressions with structured, repeatable evaluation so you can point to a number instead of a feeling.
Here are the five dimensions that matter for ecommerce:
| Dimension | Type | What It Measures |
|---|---|---|
| Product Relevance | Score (0-10) | Did recommendations match the customer's stated needs? |
| Price Compliance | Pass/Fail | Did the agent stay within the customer's budget constraints? |
| Recovery Quality | Score (0-10) | After rejection, did follow-up suggestions feel meaningfully different? |
| Memory Usage | Pass/Fail | Did the agent reference stored preferences when available? |
| Conversion Signals | Score (0-10) | Did the agent include clear next steps (add to cart, view details, compare)? |
Define the scorecard:
const { data: scorecard } = await sdk.scorecards.create({
name: "Shopping Assistant Quality",
description: "Evaluates product recommendations, price awareness, and recovery",
status: "active",
passingThreshold: 70, // Below 70 = failed QA
scoringAlgorithm: "weighted_average", // Relevance weighted higher than recovery
industry: "ecommerce",
useCase: "support",
// Criteria (relevance, price compliance, recovery) configured
// in the dashboard or via follow-up API calls
});After a persona scenario runs, evaluate the conversation:
// Run AI evaluator against the full conversation transcript
const { data: evaluation } = await sdk.scorecards.evaluate(callId, {
scorecardId: scorecard.id,
});
const { data: results } = await sdk.scorecards.getResultsByCall(callId);
const result = results.results[0];
console.log("Overall score:", result.overallScore);
for (const cr of result.criteriaResults) {
console.log(` ${cr.criteriaKey}: ${cr.result} (${cr.passed ? "PASS" : "FAIL"})`);
console.log(` Reasoning: ${cr.reasoning}`); // AI explains WHY it scored this way
}You don't get a single thumbs-up. You get a breakdown: product relevance was 8/10, but recovery quality scored 4/10 because the agent kept suggesting the same category after the customer asked for something different. That Trailblazer Pro from Part 1? If the scorecard catches the agent recommending it after the customer said "no hiking boots," you know exactly which prompt to fix.
This is what makes scorecard evaluation different from traditional A/B testing. A/B tests tell you which version converts better. Scorecards tell you why. Was it the product relevance? The price compliance? The recovery after rejection? Each dimension isolates a specific capability so you can improve it without guessing.
Wire scorecard evaluation into your CI pipeline. Run four personas, evaluate all four conversations, and fail the build if any score drops below your passing threshold. Now you have a regression test for conversation quality, not just code correctness.
More on scorecard design: Scorecards vs. Vibes: How to Actually Measure AI Agent Quality.

Production Monitoring: Five Metrics That Predict Customer Satisfaction
Testing catches problems before launch. Monitoring catches them after. No persona set covers every real-world conversation, and customer behavior at scale always surprises you. Someone will ask for a product category you never considered. Someone will phrase a request in a way that bypasses your search entirely. Standard infrastructure metrics (uptime, latency, error rates) are table stakes. These five behavioral metrics predict customer experience problems before they show up in reviews:
| Metric | Healthy Range | Alert Threshold | What It Means |
|---|---|---|---|
| Recommendation acceptance | >30% | <20% | Customers aren't clicking. Your relevance model needs work. |
| Conversation depth | 3-7 turns | >10 avg turns | Agent is looping or missing intent. |
| Recovery rate | >40% | <25% | Agent can't adapt after rejection. Fix your prompts. |
| Memory hit rate | >60% | <40% | Memory integration is broken. Debug the retrieval pipeline. |
| Fallback rate | <10% | >15% | Catalog has gaps. Investigate by category. |
Recommendation acceptance is the most direct signal. If customers consistently ignore what the agent suggests, the relevance model is not working regardless of what your persona test scores say. Check whether your product data (images, descriptions, pricing) is complete. A product card without an image gets skipped.
Conversation depth acts as a canary. Extremely short conversations (1-2 turns) mean the agent is not engaging. Extremely long ones (10+ turns) mean it is not converging on useful suggestions. The sweet spot is 3-7 turns: enough to understand intent, not so many that the customer gives up.
Recovery rate measures resilience. When a customer says "no, not that," does the next suggestion lead somewhere productive? Low recovery rates mean the agent does not adapt to feedback. This is almost always a prompt problem.
Memory hit rate matters most for return customers. If your knowledge base has customer data but the agent is not referencing it, the memory integration needs debugging. This is the metric that separates a generic chatbot from a personal shopping assistant.
Fallback rate exposes catalog gaps. Every "I couldn't find anything matching your request" is a missed sale. Track fallbacks by product category to find exactly where the holes are.
Track these daily with analytics dashboards. A single bad day is noise. A week-long trend is a signal.
More on monitoring strategy: Real-Time Monitoring for AI Agents.
The Complete Stack
Here's the full architecture across all three parts:
Three articles ago, the AI returned text. Now it renders product cards from real catalogs, remembers the customer across visits, and gets tested by AI shoppers before a single real customer sees it.
The widget layer from Part 1 makes recommendations visible. The intelligence layer from Part 2 makes them relevant. The testing and monitoring layer from this article makes them reliable. Each layer reinforces the others. Beautiful product cards do not matter if the recommendations are poor. Smart recommendations do not matter if you cannot verify them across customer segments. And testing before launch does not matter if you are not watching what happens after.
Build smarter shopping agents
Chanl gives AI shopping assistants the backend they need: product knowledge, customer memory, MCP tools, and testing. The rendering is yours. The intelligence is handled.
Start buildingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



