A customer calls. She wants a refund for a damaged product, a replacement shipped overnight, and a callback scheduled with a manager to discuss her account. Three tasks. Three different backend systems. One conversation that needs to feel seamless.
This is the moment single-agent architectures break. Not because the model is dumb, but because one agent trying to hold refund policies, inventory queries, and scheduling logic simultaneously starts dropping context by turn four. The refund amount is wrong. The replacement ships to the old address. The callback never gets scheduled.
Gartner saw a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. By end of 2026, 40% of enterprise applications will embed task-specific AI agents, up from less than 5% in 2025. The question has shifted from "should we use multiple agents?" to "which orchestration pattern won't collapse under production traffic?"
This article answers that question. We'll follow our customer's three-part request through every major orchestration pattern, show you exactly where each one breaks, and land on the architecture that's actually surviving in production systems today.
Table of Contents
- The Patterns at a Glance
- Flat Routing: Fast but Fragile
- Sequential Pipeline: Predictable but Slow
- Hierarchical: Production's Favorite
- Plan-and-Execute: The Cost Killer
- The Passing Ships Problem
- Framework Comparison for Production
- What Actually Breaks at Scale
- The Pattern Decision Tree
The Patterns at a Glance
Before diving in, here's the landscape. Each pattern handles our customer's three-part request differently.
| Pattern | How It Routes | Latency | Cost | Debuggability | Best For |
|---|---|---|---|---|---|
| Flat routing | Classifier picks one specialist | Low | Low | Hard | Single-intent requests |
| Sequential pipeline | Agent A then B then C | High | Medium | Easy | Dependent steps |
| Hierarchical | Orchestrator delegates dynamically | Medium | Medium | Best | Complex multi-step requests |
| Plan-and-execute | Expensive planner, cheap executors | Medium | Lowest | Good | Cost-sensitive at scale |
If you're new to multi-agent systems, start with our guide on building an agent orchestrator from scratch. It covers the fundamentals. This article picks up where that leaves off: what happens when those patterns meet real traffic.
Flat Routing: Fast but Fragile
The simplest multi-agent pattern. A classifier looks at the customer's message and routes it to one specialist agent.
// Flat routing -- fast, but can only handle one intent per message
async function routeToSpecialist(message: string) {
// Cheap model classifies intent (costs ~$0.001 per call)
const intent = await classify(message, {
model: "gpt-4o-mini",
categories: ["refund", "replacement", "scheduling", "general"],
});
// Route to the one specialist that matches
return specialists[intent].handle(message);
}Where it works. Single-intent messages. "I want a refund" goes to the refund agent. Fast, cheap, done.
Where it breaks. Our customer said three things in one message. The classifier picks "refund" because it appears first. The replacement and callback requests vanish. She repeats herself. The agent apologizes and handles the replacement. The callback still never happens.
This is the most common production failure mode for flat routing: multi-intent messages get truncated to single intents. Research from Maxim AI found that specification failures, where the system misunderstands what the user actually needs, account for approximately 42% of multi-agent failures.
Flat routing works for chatbots that handle one question at a time. It does not work for customer service, where real conversations are messy, multi-part, and context-dependent.
Sequential Pipeline: Predictable but Slow
Pipeline the request through specialists in order. Each agent's output feeds into the next.
// Sequential pipeline -- predictable, but every step waits for the last
async function sequentialPipeline(request: CustomerRequest) {
// Step 1: Process refund (2-3 seconds)
const refundResult = await refundAgent.handle(request);
// Step 2: Order replacement -- needs refund context to avoid double-charging
const replacementResult = await replacementAgent.handle({
...request,
refundConfirmation: refundResult,
});
// Step 3: Schedule callback -- needs both prior results for summary
const callbackResult = await schedulingAgent.handle({
...request,
refundConfirmation: refundResult,
replacementConfirmation: replacementResult,
});
return mergeResults(refundResult, replacementResult, callbackResult);
}Where it works. When steps genuinely depend on each other. The replacement agent needs to know the refund was processed before shipping (to avoid double-charging). The scheduling agent needs both confirmations to summarize what happened.
Where it breaks. Our customer is still waiting. Latency compounds. Three agents running sequentially means 6-9 seconds of wall-clock time. On a voice call, that's dead air. On chat, she's already typed "hello?" and "are you there?"
Worse, the pipeline assumes a fixed order. What if the replacement is out of stock? The scheduling agent still runs, scheduling a callback about a replacement that won't arrive. The pipeline has no way to adapt.
Sequential pipelines are great for batch processing. For real-time customer conversations, the rigidity is a liability.
Hierarchical: Production's Favorite
Here's the pattern that actually survives production. An orchestrator agent receives the full request, decomposes it into subtasks, delegates each to a specialist, and merges the results.
// Hierarchical orchestration -- the orchestrator owns the full lifecycle
async function orchestrate(message: string, context: ConversationContext) {
// Step 1: Orchestrator decomposes into subtasks
// Uses a capable model because decomposition is the hardest part
const plan = await orchestrator.decompose(message, {
model: "claude-sonnet-4-20250514",
availableAgents: ["refund", "replacement", "scheduling"],
conversationHistory: context.history,
});
// Step 2: Execute subtasks (parallel when independent, sequential when dependent)
const results = {};
for (const step of plan.steps) {
if (step.dependsOn && !results[step.dependsOn]) {
// Wait for dependency -- replacement needs refund result
continue;
}
// Delegate to specialist with only the context it needs
results[step.id] = await specialists[step.agent].handle({
task: step.description,
context: step.dependsOn ? results[step.dependsOn] : null,
});
}
// Step 3: Orchestrator merges results into one coherent response
return orchestrator.synthesize(results, context);
}Why this wins in production. Three properties that the other patterns lack:
Clear accountability. The orchestrator owns the outcome. When the replacement is out of stock, the orchestrator catches it and adapts, maybe offering a store credit instead, rescheduling the callback topic. No rigid pipeline to derail.
Debuggable traces. Every delegation is a logged event: orchestrator decided to send subtask X to agent Y with context Z. When something goes wrong at 3am, you can replay the exact decision chain. GitHub's engineering team found that treating agents like distributed system components, with typed handoffs and explicit state contracts, is the key to reliability.
Graceful degradation. If the scheduling service is down, the orchestrator handles the refund and replacement, then tells our customer: "I've processed your refund and replacement. Our scheduling system is temporarily unavailable. I'll have someone call you within 2 hours." Two out of three requests handled. That's a partial success, not a total failure.
Microsoft's Azure Architecture Center documents this as the recommended pattern for production agent systems, with the orchestrator maintaining state and the specialists remaining stateless and focused.
Plan-and-Execute: The Cost Killer
Here's where it gets interesting. Hierarchical orchestration works, but it's expensive. The orchestrator runs a capable model for every customer interaction. At scale, those tokens add up.
Plan-and-execute splits the architecture into two tiers: an expensive model that thinks, and cheap models that do.
// Plan-and-execute -- expensive model plans, cheap models execute
async function planAndExecute(message: string, context: ConversationContext) {
// PLANNER: Claude Sonnet decomposes the request ($0.003/1K input tokens)
// This is the only expensive call -- it happens once per request
const plan = await planner.createPlan(message, {
model: "claude-sonnet-4-20250514",
availableTools: ["process_refund", "check_inventory", "order_replacement",
"schedule_callback", "lookup_customer"],
});
// EXECUTOR: GPT-4o-mini runs each step ($0.00015/1K input tokens)
// 20x cheaper per token -- and most steps are simple tool calls
const results = {};
for (const step of plan.steps) {
results[step.id] = await executor.run(step, {
model: "gpt-4o-mini", // Classification and tool calls don't need Sonnet
tools: step.requiredTools,
priorResults: step.dependencies.map((d) => results[d]),
});
}
// SYNTHESIZER: Back to Sonnet for the customer-facing response
// Merging three results into natural language needs the bigger model
return synthesizer.compose(results, {
model: "claude-sonnet-4-20250514",
tone: context.customerSentiment,
});
}The math. Our customer's request generates roughly 3,000 tokens across the three specialist steps. Running everything on Claude Sonnet: ~$0.009. Running the plan-and-execute split with GPT-4o-mini for execution: ~$0.0015. That's an 83% cost reduction on a single request. At 100,000 daily customer interactions, that's the difference between $900/day and $150/day.
The insight is that most execution steps are simple: call an API, extract a field, classify a status. These tasks don't need frontier-model reasoning. A routing classification call costs ~$0.0025 and can redirect 30% of tasks to models that are 90% cheaper. The planner is the only step that needs to understand the full complexity of the request.
Conventional wisdom says you should use your best model everywhere for quality. The data says most execution steps are so simple that a model 20x cheaper handles them identically. Save the expensive reasoning for the one step that actually needs it.
When to use it. Plan-and-execute shines when you have high request volume (the savings compound), most execution steps are tool calls or structured extraction, and your quality requirements are met by smaller models for individual steps. If every step requires nuanced reasoning, the cost savings evaporate because you can't downgrade the executor model.
The Passing Ships Problem
Every multi-agent pattern shares one insidious failure mode. We call it the "passing ships" problem, and it's the reason most teams hit a wall around month three of production.
Here's how it manifests. Our customer's refund agent processes a $47.99 refund. The replacement agent, running in parallel, checks inventory and finds the item is discontinued. It substitutes a similar product at $52.99. The scheduling agent books a callback for "replacement follow-up."
From each agent's perspective, the job is done. From the customer's perspective, she was charged $4.00 extra without being asked, and the callback is about the wrong thing. The agents were ships passing in the night, each doing its job correctly in isolation, collectively producing a broken experience.
// THE PROBLEM: Agents can't see each other's decisions
// Each agent operates on its own snapshot of reality
// Refund agent sees: customer wants $47.99 back ✓
// Replacement agent sees: item discontinued, substitute available ✓
// Scheduling agent sees: customer wants callback about replacement ✓
// Nobody sees: the substitute costs more, and the callback topic is now wrong
// THE FIX: Shared scratchpad with real-time writes
const scratchpad = new SharedState(requestId);
async function executeWithSharedState(agent, task) {
// Agent reads latest state before starting -- sees what others have done
const currentState = await scratchpad.read();
const result = await agent.handle({
task,
sharedContext: currentState, // Full picture, not just its own slice
});
// Agent writes result back -- other agents see it immediately
await scratchpad.write(agent.id, result);
return result;
}The fix is structural, not algorithmic. Every agent reads from and writes to a shared scratchpad before and after execution. The orchestrator checks for conflicts before merging results. If the replacement agent changes the product, the orchestrator re-runs the refund calculation and updates the callback topic.
GitHub's multi-agent reliability analysis found that most failures trace back to missing structural components: shared state, ordering assumptions, and implicit handoffs. The agents aren't broken. The connections between them are.
This is also where monitoring and observability become non-negotiable. You need to see the full trace across all agents, not just individual agent logs, to catch these cross-agent coordination failures before customers do.
Framework Comparison for Production
You've picked an orchestration pattern. Now you need to implement it. Here's how the major frameworks compare for production multi-agent systems in 2026.
| Framework | Orchestration Style | Production Readiness | Best For | Watch Out For |
|---|---|---|---|---|
| LangGraph | Graph-based state machines | Most battle-tested | Complex branching, rollback, deterministic execution | Steep learning curve (graph theory required) |
| CrewAI | Role-based teams | Good, less mature monitoring | Rapid prototyping, team-based workflows | 40% faster to deploy but harder to debug at scale |
| AutoGen | Conversational agents | Production-ready (maintenance mode) | Multi-party dialogues, consensus | Microsoft shifted focus to Agent Framework |
| OpenAI Agents SDK | Built-in handoffs | Growing ecosystem | OpenAI-native stacks | Vendor lock-in to OpenAI models |
| Microsoft Agent Framework | Enterprise orchestration | New, actively developed | Azure-native enterprise | Early stage, API surface still evolving |
The honest recommendation. If you need production reliability today, LangGraph gives you the most control over state, branching, and error recovery. If you need to ship a prototype in a week, CrewAI's role-based abstraction gets you there fastest. If you're building on Azure, Microsoft's Agent Framework is the natural fit but expect to be an early adopter.
If your system is 2-4 agents with a clear workflow, you might not need a framework at all. A 150-line orchestrator with explicit handoffs is easier to debug than any framework's abstractions.
What Actually Breaks at Scale
We've covered the patterns. Here's what production teaches you that documentation doesn't.
Cascading hallucinations. Agent A hallucinates a policy ("refunds over $100 require manager approval"). Agent B, receiving this as context, treats it as fact and escalates unnecessarily. Agent C schedules a manager callback that shouldn't exist. One hallucination, three agents deep, creates a customer experience that's confidently wrong at every step.
The fix: Each agent validates its inputs against ground truth. The refund agent checks the actual refund policy, not what the orchestrator summarized. Tool integrations that connect agents to authoritative data sources prevent agents from operating on stale or fabricated context.
State drift under concurrency. Two agents read the customer's account balance at the same time. Both proceed as if the balance is $100. One issues a $47.99 refund, the other places a $52.99 order. The account goes negative. This is the distributed systems version of a race condition, and it's endemic to parallel agent execution.
The fix: Optimistic locking on shared state. Agents claim resources before modifying them. The orchestrator detects conflicts and retries.
Context window exhaustion. A five-agent system where each agent passes its full output to the next agent hits context limits by agent three. The later agents are operating on truncated input, missing critical details from the early agents.
The fix: Structured handoffs. Each agent produces a summary (50-100 tokens) alongside its full output. Downstream agents receive summaries by default, with the option to request full context for specific fields.
Evaluation blind spots. You test each agent individually. They all pass. The system fails in production because nobody tested the handoffs. Scenario testing with AI personas that simulate multi-part customer requests is the only reliable way to catch cross-agent failures before production.
The Pattern Decision Tree
Start here when choosing an orchestration pattern.
For most customer-facing production systems, the answer is hierarchical with plan-and-execute optimization. The orchestrator handles decomposition and conflict resolution using a capable model. Individual specialists execute using the cheapest model that can handle their specific task. Memory ensures context persists across the full interaction, so no agent starts from zero.
The autonomous AI agent market is projected to reach $8.5 billion by end of 2026, with Deloitte noting that enterprises who orchestrate agents well could push that figure 15-30% higher. But Gartner also warns that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs and inadequate risk controls.
The difference between the projects that survive and the ones that get canceled? The surviving ones pick a pattern that matches their actual complexity, build observability in from day one, and resist the temptation to add agents when better prompts or tools would solve the problem.
Our customer got her refund, her replacement, and her callback. Three agents, one orchestrator, and a shared scratchpad that kept them all on the same page. That's not a demo. That's Tuesday.
Build multi-agent systems with shared tools, memory, and monitoring
Chanl gives every agent access to the same tools, knowledge base, and persistent memory -- then monitors the full orchestration trace across all agents in production.
Start buildingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



