What is multi-agent orchestration?

Multi-agent orchestration coordinates multiple specialized AI agents to handle complex tasks that no single agent can manage well alone. An orchestrator agent decomposes a request into subtasks, delegates each to a specialist agent (refund processing, order lookup, scheduling), and merges the results into a coherent response for the customer.

Which multi-agent pattern is best for production?

Hierarchical orchestration is the most production-ready pattern because it provides clear accountability (one orchestrator owns the outcome), debuggable traces (every delegation is logged), and graceful degradation (the orchestrator can retry or reroute when a specialist fails). Plan-and-execute is the best cost optimization on top of hierarchical, using an expensive model to plan and cheap models to execute.

How do you reduce costs in a multi-agent system?

The plan-and-execute pattern uses an expensive reasoning model (like GPT-4o or Claude Sonnet) to create a step-by-step plan, then routes each step to the cheapest model capable of executing it. Classification and extraction tasks go to models that cost 90% less than the planner. This approach can reduce token costs by 70-90% compared to running every step on the largest model.

What is the passing ships problem in multi-agent systems?

The passing ships problem occurs when agents working on the same customer request cannot see each other's progress or results. Agent A processes a refund while Agent B looks up replacement inventory, but neither knows the other exists. The result is duplicate work, contradictory responses, or incomplete resolutions that force customers to repeat themselves.

How do you debug multi-agent systems in production?

Treat agent runs as sequences of state transitions, not log streams. Each agent writes its output to a shared scratchpad with timestamps and agent IDs. The orchestrator logs every delegation decision. This creates a replayable trace where you can see exactly which agent received what input, what it produced, and how the orchestrator merged results.

What percentage of multi-agent AI projects fail?

Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027, primarily due to escalating costs, unclear business value, or inadequate risk controls. The most common technical failure is coordination breakdown between agents, which accounts for approximately 37% of multi-agent system failures according to production reliability studies.

The Multi-Agent Pattern That Actually Works in Production

A customer calls. She wants a refund for a damaged product, a replacement shipped overnight, and a callback scheduled with a manager to discuss her account. Three tasks. Three different backend systems. One conversation that needs to feel seamless.

This is the moment single-agent architectures break. Not because the model is dumb, but because one agent trying to hold refund policies, inventory queries, and scheduling logic simultaneously starts dropping context by turn four. The refund amount is wrong. The replacement ships to the old address. The callback never gets scheduled.

Gartner saw a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. By end of 2026, 40% of enterprise applications will embed task-specific AI agents, up from less than 5% in 2025. The question has shifted from "should we use multiple agents?" to "which orchestration pattern won't collapse under production traffic?"

This article answers that question. We'll follow our customer's three-part request through every major orchestration pattern, show you exactly where each one breaks, and land on the architecture that's actually surviving in production systems today.

The Patterns at a Glance
Flat Routing: Fast but Fragile
Sequential Pipeline: Predictable but Slow
Hierarchical: Production's Favorite
Plan-and-Execute: The Cost Killer
The Passing Ships Problem
Framework Comparison for Production
What Actually Breaks at Scale
The Pattern Decision Tree

The Patterns at a Glance

Before diving in, here's the landscape. Each pattern handles our customer's three-part request differently.

Pattern	How It Routes	Latency	Cost	Debuggability	Best For
Flat routing	Classifier picks one specialist	Low	Low	Hard	Single-intent requests
Sequential pipeline	Agent A then B then C	High	Medium	Easy	Dependent steps
Hierarchical	Orchestrator delegates dynamically	Medium	Medium	Best	Complex multi-step requests
Plan-and-execute	Expensive planner, cheap executors	Medium	Lowest	Good	Cost-sensitive at scale

If you're new to multi-agent systems, start with our guide on building an agent orchestrator from scratch. It covers the fundamentals. This article picks up where that leaves off: what happens when those patterns meet real traffic.

Flat Routing: Fast but Fragile

The simplest multi-agent pattern. A classifier looks at the customer's message and routes it to one specialist agent.

Flat routing: a classifier picks one specialist per message

typescript

// Flat routing -- fast, but can only handle one intent per message
async function routeToSpecialist(message: string) {
  // Cheap model classifies intent (costs ~$0.001 per call)
  const intent = await classify(message, {
    model: "gpt-4o-mini",
    categories: ["refund", "replacement", "scheduling", "general"],
  });
 
  // Route to the one specialist that matches
  return specialists[intent].handle(message);
}

Where it works. Single-intent messages. "I want a refund" goes to the refund agent. Fast, cheap, done.

Where it breaks. Our customer said three things in one message. The classifier picks "refund" because it appears first. The replacement and callback requests vanish. She repeats herself. The agent apologizes and handles the replacement. The callback still never happens.

This is the most common production failure mode for flat routing: multi-intent messages get truncated to single intents. Research from Maxim AI found that specification failures, where the system misunderstands what the user actually needs, account for approximately 42% of multi-agent failures.

Flat routing works for chatbots that handle one question at a time. It does not work for customer service, where real conversations are messy, multi-part, and context-dependent.

Sequential Pipeline: Predictable but Slow

Pipeline the request through specialists in order. Each agent's output feeds into the next.

Sequential pipeline: each agent processes and passes forward

typescript

// Sequential pipeline -- predictable, but every step waits for the last
async function sequentialPipeline(request: CustomerRequest) {
  // Step 1: Process refund (2-3 seconds)
  const refundResult = await refundAgent.handle(request);
 
  // Step 2: Order replacement -- needs refund context to avoid double-charging
  const replacementResult = await replacementAgent.handle({
    ...request,
    refundConfirmation: refundResult,
  });
 
  // Step 3: Schedule callback -- needs both prior results for summary
  const callbackResult = await schedulingAgent.handle({
    ...request,
    refundConfirmation: refundResult,
    replacementConfirmation: replacementResult,
  });
 
  return mergeResults(refundResult, replacementResult, callbackResult);
}

Where it works. When steps genuinely depend on each other. The replacement agent needs to know the refund was processed before shipping (to avoid double-charging). The scheduling agent needs both confirmations to summarize what happened.

Where it breaks. Our customer is still waiting. Latency compounds. Three agents running sequentially means 6-9 seconds of wall-clock time. On a voice call, that's dead air. On chat, she's already typed "hello?" and "are you there?"

Worse, the pipeline assumes a fixed order. What if the replacement is out of stock? The scheduling agent still runs, scheduling a callback about a replacement that won't arrive. The pipeline has no way to adapt.

Sequential pipelines are great for batch processing. For real-time customer conversations, the rigidity is a liability.

Hierarchical: Production's Favorite

Here's the pattern that actually survives production. An orchestrator agent receives the full request, decomposes it into subtasks, delegates each to a specialist, and merges the results.

Hierarchical orchestration: one orchestrator owns the outcome

typescript

// Hierarchical orchestration -- the orchestrator owns the full lifecycle
async function orchestrate(message: string, context: ConversationContext) {
  // Step 1: Orchestrator decomposes into subtasks
  // Uses a capable model because decomposition is the hardest part
  const plan = await orchestrator.decompose(message, {
    model: "claude-sonnet-4-20250514",
    availableAgents: ["refund", "replacement", "scheduling"],
    conversationHistory: context.history,
  });
 
  // Step 2: Execute subtasks (parallel when independent, sequential when dependent)
  const results = {};
  for (const step of plan.steps) {
    if (step.dependsOn && !results[step.dependsOn]) {
      // Wait for dependency -- replacement needs refund result
      continue;
    }
 
    // Delegate to specialist with only the context it needs
    results[step.id] = await specialists[step.agent].handle({
      task: step.description,
      context: step.dependsOn ? results[step.dependsOn] : null,
    });
  }
 
  // Step 3: Orchestrator merges results into one coherent response
  return orchestrator.synthesize(results, context);
}

Why this wins in production. Three properties that the other patterns lack:

Clear accountability. The orchestrator owns the outcome. When the replacement is out of stock, the orchestrator catches it and adapts, maybe offering a store credit instead, rescheduling the callback topic. No rigid pipeline to derail.

Debuggable traces. Every delegation is a logged event: orchestrator decided to send subtask X to agent Y with context Z. When something goes wrong at 3am, you can replay the exact decision chain. GitHub's engineering team found that treating agents like distributed system components, with typed handoffs and explicit state contracts, is the key to reliability.

Graceful degradation. If the scheduling service is down, the orchestrator handles the refund and replacement, then tells our customer: "I've processed your refund and replacement. Our scheduling system is temporarily unavailable. I'll have someone call you within 2 hours." Two out of three requests handled. That's a partial success, not a total failure.

Microsoft's Azure Architecture Center documents this as the recommended pattern for production agent systems, with the orchestrator maintaining state and the specialists remaining stateless and focused.

Plan-and-Execute: The Cost Killer

Here's where it gets interesting. Hierarchical orchestration works, but it's expensive. The orchestrator runs a capable model for every customer interaction. At scale, those tokens add up.

Plan-and-execute splits the architecture into two tiers: an expensive model that thinks, and cheap models that do.

typescript

// Plan-and-execute -- expensive model plans, cheap models execute
async function planAndExecute(message: string, context: ConversationContext) {
  // PLANNER: Claude Sonnet decomposes the request ($0.003/1K input tokens)
  // This is the only expensive call -- it happens once per request
  const plan = await planner.createPlan(message, {
    model: "claude-sonnet-4-20250514",
    availableTools: ["process_refund", "check_inventory", "order_replacement",
                     "schedule_callback", "lookup_customer"],
  });
 
  // EXECUTOR: GPT-4o-mini runs each step ($0.00015/1K input tokens)
  // 20x cheaper per token -- and most steps are simple tool calls
  const results = {};
  for (const step of plan.steps) {
    results[step.id] = await executor.run(step, {
      model: "gpt-4o-mini", // Classification and tool calls don't need Sonnet
      tools: step.requiredTools,
      priorResults: step.dependencies.map((d) => results[d]),
    });
  }
 
  // SYNTHESIZER: Back to Sonnet for the customer-facing response
  // Merging three results into natural language needs the bigger model
  return synthesizer.compose(results, {
    model: "claude-sonnet-4-20250514",
    tone: context.customerSentiment,
  });
}

The math. Our customer's request generates roughly 3,000 tokens across the three specialist steps. Running everything on Claude Sonnet: ~$0.009. Running the plan-and-execute split with GPT-4o-mini for execution: ~$0.0015. That's an 83% cost reduction on a single request. At 100,000 daily customer interactions, that's the difference between $900/day and $150/day.

The insight is that most execution steps are simple: call an API, extract a field, classify a status. These tasks don't need frontier-model reasoning. A routing classification call costs ~$0.0025 and can redirect 30% of tasks to models that are 90% cheaper. The planner is the only step that needs to understand the full complexity of the request.

Conventional wisdom says you should use your best model everywhere for quality. The data says most execution steps are so simple that a model 20x cheaper handles them identically. Save the expensive reasoning for the one step that actually needs it.

When to use it. Plan-and-execute shines when you have high request volume (the savings compound), most execution steps are tool calls or structured extraction, and your quality requirements are met by smaller models for individual steps. If every step requires nuanced reasoning, the cost savings evaporate because you can't downgrade the executor model.

The Passing Ships Problem

Every multi-agent pattern shares one insidious failure mode. We call it the "passing ships" problem, and it's the reason most teams hit a wall around month three of production.

Here's how it manifests. Our customer's refund agent processes a $47.99 refund. The replacement agent, running in parallel, checks inventory and finds the item is discontinued. It substitutes a similar product at $52.99. The scheduling agent books a callback for "replacement follow-up."

From each agent's perspective, the job is done. From the customer's perspective, she was charged $4.00 extra without being asked, and the callback is about the wrong thing. The agents were ships passing in the night, each doing its job correctly in isolation, collectively producing a broken experience.

typescript

// THE PROBLEM: Agents can't see each other's decisions
// Each agent operates on its own snapshot of reality
 
// Refund agent sees: customer wants $47.99 back ✓
// Replacement agent sees: item discontinued, substitute available ✓
// Scheduling agent sees: customer wants callback about replacement ✓
 
// Nobody sees: the substitute costs more, and the callback topic is now wrong
 
// THE FIX: Shared scratchpad with real-time writes
const scratchpad = new SharedState(requestId);
 
async function executeWithSharedState(agent, task) {
  // Agent reads latest state before starting -- sees what others have done
  const currentState = await scratchpad.read();
 
  const result = await agent.handle({
    task,
    sharedContext: currentState, // Full picture, not just its own slice
  });
 
  // Agent writes result back -- other agents see it immediately
  await scratchpad.write(agent.id, result);
 
  return result;
}

The fix is structural, not algorithmic. Every agent reads from and writes to a shared scratchpad before and after execution. The orchestrator checks for conflicts before merging results. If the replacement agent changes the product, the orchestrator re-runs the refund calculation and updates the callback topic.

GitHub's multi-agent reliability analysis found that most failures trace back to missing structural components: shared state, ordering assumptions, and implicit handoffs. The agents aren't broken. The connections between them are.

This is also where monitoring and observability become non-negotiable. You need to see the full trace across all agents, not just individual agent logs, to catch these cross-agent coordination failures before customers do.

Framework Comparison for Production

You've picked an orchestration pattern. Now you need to implement it. Here's how the major frameworks compare for production multi-agent systems in 2026.

Framework	Orchestration Style	Production Readiness	Best For	Watch Out For
LangGraph	Graph-based state machines	Most battle-tested	Complex branching, rollback, deterministic execution	Steep learning curve (graph theory required)
CrewAI	Role-based teams	Good, less mature monitoring	Rapid prototyping, team-based workflows	40% faster to deploy but harder to debug at scale
AutoGen	Conversational agents	Production-ready (maintenance mode)	Multi-party dialogues, consensus	Microsoft shifted focus to Agent Framework
OpenAI Agents SDK	Built-in handoffs	Growing ecosystem	OpenAI-native stacks	Vendor lock-in to OpenAI models
Microsoft Agent Framework	Enterprise orchestration	New, actively developed	Azure-native enterprise	Early stage, API surface still evolving

The honest recommendation. If you need production reliability today, LangGraph gives you the most control over state, branching, and error recovery. If you need to ship a prototype in a week, CrewAI's role-based abstraction gets you there fastest. If you're building on Azure, Microsoft's Agent Framework is the natural fit but expect to be an early adopter.

If your system is 2-4 agents with a clear workflow, you might not need a framework at all. A 150-line orchestrator with explicit handoffs is easier to debug than any framework's abstractions.

What Actually Breaks at Scale

We've covered the patterns. Here's what production teaches you that documentation doesn't.

Cascading hallucinations. Agent A hallucinates a policy ("refunds over $100 require manager approval"). Agent B, receiving this as context, treats it as fact and escalates unnecessarily. Agent C schedules a manager callback that shouldn't exist. One hallucination, three agents deep, creates a customer experience that's confidently wrong at every step.

The fix: Each agent validates its inputs against ground truth. The refund agent checks the actual refund policy, not what the orchestrator summarized. Tool integrations that connect agents to authoritative data sources prevent agents from operating on stale or fabricated context.

State drift under concurrency. Two agents read the customer's account balance at the same time. Both proceed as if the balance is $100. One issues a $47.99 refund, the other places a $52.99 order. The account goes negative. This is the distributed systems version of a race condition, and it's endemic to parallel agent execution.

The fix: Optimistic locking on shared state. Agents claim resources before modifying them. The orchestrator detects conflicts and retries.

Context window exhaustion. A five-agent system where each agent passes its full output to the next agent hits context limits by agent three. The later agents are operating on truncated input, missing critical details from the early agents.

The fix: Structured handoffs. Each agent produces a summary (50-100 tokens) alongside its full output. Downstream agents receive summaries by default, with the option to request full context for specific fields.

Evaluation blind spots. You test each agent individually. They all pass. The system fails in production because nobody tested the handoffs. Scenario testing with AI personas that simulate multi-part customer requests is the only reliable way to catch cross-agent failures before production.

The Pattern Decision Tree

Start here when choosing an orchestration pattern.

Decision tree for choosing an orchestration pattern

For most customer-facing production systems, the answer is hierarchical with plan-and-execute optimization. The orchestrator handles decomposition and conflict resolution using a capable model. Individual specialists execute using the cheapest model that can handle their specific task. Memory ensures context persists across the full interaction, so no agent starts from zero.

The autonomous AI agent market is projected to reach $8.5 billion by end of 2026, with Deloitte noting that enterprises who orchestrate agents well could push that figure 15-30% higher. But Gartner also warns that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs and inadequate risk controls.

The difference between the projects that survive and the ones that get canceled? The surviving ones pick a pattern that matches their actual complexity, build observability in from day one, and resist the temptation to add agents when better prompts or tools would solve the problem.

Our customer got her refund, her replacement, and her callback. Three agents, one orchestrator, and a shared scratchpad that kept them all on the same page. That's not a demo. That's Tuesday.

Build multi-agent systems with shared tools, memory, and monitoring

Chanl gives every agent access to the same tools, knowledge base, and persistent memory -- then monitors the full orchestration trace across all agents in production.

Start building

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

multi-agent agent-orchestration ai-agents customer-experience production tools

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.