Your customer calls in, describes water damage to their kitchen ceiling, and texts a photo of the leak. Your agent needs to hear the frustration in their voice, see the extent of the damage in the image, and read their policy documents — all within the same conversation. That's not three separate AI systems. That's one multimodal agent, and building it for production is fundamentally different from wiring together API demos.
The multimodal AI market hit $3.85 billion in 2026 and is growing at nearly 29% annually. But market size doesn't tell you how to actually build these systems. Most tutorials show you how to send an image to GPT-4o and get a text response. Production multimodal agents need to handle voice pipelines with sub-800ms latency, fuse information across modalities in real time, degrade gracefully when a camera feed drops, and do all of this at scale.
This article builds a multimodal agent architecture from the ground up. We'll start with what "multimodal" actually means for agents (not just LLMs), work through voice pipelines, vision integration, and cross-modal fusion, then assemble a production-ready architecture with real TypeScript code. Every section addresses latency, failure modes, and the tradeoffs you'll face in real deployments.
Prerequisites and Setup
You'll need Node.js 20+, TypeScript 5+, and familiarity with async/await patterns. Experience with at least one LLM API (OpenAI, Anthropic, or Google) is assumed.
npm install openai @google/generative-ai ws zodIf you're new to how AI agents use tools and external services, read AI Agent Tools: MCP, OpenAPI, and Tool Management first — multimodal agents rely heavily on tool orchestration. For prompt design fundamentals that apply across all modalities, Prompt Engineering from First Principles covers the techniques referenced throughout this article.
The code examples are self-contained TypeScript. Each snippet runs independently — no framework beyond the installed packages above.
What "Multimodal" Actually Means for Agents
A multimodal AI agent orchestrates perception, reasoning, and action across voice, vision, and text within a single conversation — it doesn't just accept multiple input types, it fuses them into coherent understanding and coordinates responses across output channels simultaneously.
That distinction matters. A multimodal LLM like GPT-4o can process images, audio, and text in one API call. That's impressive, but it's a single inference step. A multimodal agent needs to:
- Perceive — continuously ingest data from multiple channels (microphone, camera, text input) at different rates
- Synchronize — align information across modalities so "this one" spoken while pointing at an image resolves correctly
- Reason — combine cross-modal context with memory, tools, and business logic
- Act — generate responses in the appropriate modality (speak, display, write) with the right timing
Here's the architectural difference:
The perception layer handles input from each modality independently — voice arrives as audio chunks, images as frames or uploads, text as messages. The orchestration layer is where the real complexity lives: synchronizing timestamps across modalities, fusing context into a unified representation, and managing conversational state. The action layer routes responses to the right output channel.
Why Not Just Use a Multimodal LLM Directly?
You might wonder: if GPT-4o or Gemini 2.5 can natively handle audio, images, and text, why build all this orchestration? Three reasons that become obvious in production:
Latency control. A single API call that processes audio + image + text has unpredictable latency. When you decompose the pipeline, you can stream each stage, start TTS before the full LLM response is ready, and meet the 300-500ms window that human conversation demands.
Component independence. When Deepgram ships a faster STT model, you swap one component. When your vision model hallucinates on medical documents, you switch to a specialized model for that domain. Monolithic multimodal calls lock you into one provider's strengths and weaknesses.
Observability. If your agent misidentifies damage in a photo, was it the vision model? The prompt? The fusion logic? Decomposed pipelines give you inspectable intermediate states. A single multimodal call is a black box.
The Voice Pipeline: STT, LLM, TTS
A production voice pipeline converts speech to text, runs inference, and converts the response back to speech — the entire round trip needs to complete in under 800ms to feel natural, which means every stage must stream and overlap rather than wait for the previous stage to finish.
Human conversation has a rhythm. Responses within 300-500ms feel natural. Gaps beyond 800ms feel sluggish. Anything past 1.5 seconds and callers start checking if the line dropped. This isn't a nice-to-have — it's the physics of human interaction that your architecture must respect.
The Cascaded Pipeline
The standard voice agent architecture chains three components:
The critical insight is streaming overlap. You don't wait for STT to finish before sending text to the LLM. You don't wait for the full LLM response before starting TTS. Each stage processes partial input and emits partial output, and the stages run concurrently.
Here's a TypeScript implementation of a streaming voice pipeline:
import { EventEmitter } from "events";
interface AudioChunk {
data: Buffer;
timestamp: number;
sampleRate: number;
}
interface PipelineConfig {
sttProvider: STTProvider;
llmProvider: LLMProvider;
ttsProvider: TTSProvider;
vadSensitivity: number;
interruptionEnabled: boolean;
maxResponseLatencyMs: number;
}
interface STTProvider {
streamTranscribe(
audio: AsyncIterable<AudioChunk>
): AsyncIterable<{ text: string; isFinal: boolean; confidence: number }>;
}
interface LLMProvider {
streamChat(
messages: Array<{ role: string; content: string }>,
systemPrompt: string
): AsyncIterable<{ token: string; done: boolean }>;
}
interface TTSProvider {
streamSynthesize(
text: AsyncIterable<string>
): AsyncIterable<AudioChunk>;
}
class VoicePipeline extends EventEmitter {
private config: PipelineConfig;
private conversationHistory: Array<{ role: string; content: string }> = [];
private isProcessing = false;
private abortController: AbortController | null = null;
constructor(config: PipelineConfig) {
super();
this.config = config;
}
async processUtterance(audioStream: AsyncIterable<AudioChunk>): Promise<void> {
// If already processing and interruption is enabled, abort current response
if (this.isProcessing && this.config.interruptionEnabled) {
this.abortController?.abort();
this.emit("interrupted");
}
this.isProcessing = true;
this.abortController = new AbortController();
const startTime = Date.now();
try {
// Stage 1: STT — streaming transcription
let fullTranscript = "";
const sttStream = this.config.sttProvider.streamTranscribe(audioStream);
for await (const segment of sttStream) {
if (this.abortController.signal.aborted) return;
if (segment.isFinal) {
fullTranscript += segment.text + " ";
this.emit("transcript", {
text: segment.text,
confidence: segment.confidence,
latencyMs: Date.now() - startTime,
});
}
}
if (!fullTranscript.trim()) return;
// Add user message to history
this.conversationHistory.push({
role: "user",
content: fullTranscript.trim(),
});
// Stage 2: LLM — streaming inference
const llmStream = this.config.llmProvider.streamChat(
this.conversationHistory,
"You are a helpful voice assistant. Keep responses concise and conversational."
);
// Stage 3: TTS — starts consuming LLM tokens as they arrive
const tokenBuffer = this.createTokenBuffer(llmStream);
const audioOutput = this.config.ttsProvider.streamSynthesize(tokenBuffer);
let fullResponse = "";
let firstAudioEmitted = false;
for await (const audioChunk of audioOutput) {
if (this.abortController.signal.aborted) return;
if (!firstAudioEmitted) {
firstAudioEmitted = true;
this.emit("firstAudio", {
totalLatencyMs: Date.now() - startTime,
});
}
this.emit("audio", audioChunk);
}
// Store assistant response
this.conversationHistory.push({
role: "assistant",
content: fullResponse,
});
this.emit("complete", {
totalLatencyMs: Date.now() - startTime,
transcript: fullTranscript.trim(),
response: fullResponse,
});
} catch (error) {
this.emit("error", { error, latencyMs: Date.now() - startTime });
} finally {
this.isProcessing = false;
}
}
private async *createTokenBuffer(
llmStream: AsyncIterable<{ token: string; done: boolean }>
): AsyncIterable<string> {
// Buffer tokens into sentence-sized chunks for natural TTS
let buffer = "";
const sentenceEnders = /[.!?]\s/;
for await (const { token, done } of llmStream) {
if (this.abortController?.signal.aborted) return;
buffer += token;
// Yield at sentence boundaries for natural speech rhythm
if (sentenceEnders.test(buffer) || done) {
if (buffer.trim()) {
yield buffer.trim();
buffer = "";
}
}
}
// Flush remaining buffer
if (buffer.trim()) {
yield buffer.trim();
}
}
}Component-Level Latency Benchmarks
Each stage in the pipeline has a latency budget. Here's what production systems achieve in 2026:
| Component | Target | Best-in-Class | Notes |
|---|---|---|---|
| VAD (Voice Activity Detection) | <30ms | ~10ms (Silero VAD) | Runs locally, negligible |
| STT (Speech-to-Text) | <200ms | ~150ms (Deepgram Nova-3) | Streaming mode, partial results |
| LLM (Time to First Token) | <400ms | ~200ms (Groq, Fireworks) | Model and provider dependent |
| TTS (Time to First Audio) | <200ms | ~75ms (ElevenLabs Flash) | Model-only; add network latency |
| Total pipeline | <800ms | ~465ms | With optimized provider stack |
The gap between "best-in-class" and "what you'll actually get" is significant. ElevenLabs reports 75ms model latency for their Flash TTS, but real-world measurements show 350ms in the US and 527ms from India once you add network round trips. Deepgram's Nova-3 achieves 5.3% word error rate in benchmarks, but production audio with background noise and overlapping speakers can push error rates above 10%.
Plan for the realistic numbers, not the vendor claims.
Interruption Handling
Natural conversation includes interruptions. A customer starts explaining a problem, your agent begins responding, and the customer cuts in with "no, not that — the other account." Your pipeline needs to:
- Detect the interruption via VAD (user speaking while agent is speaking)
- Abort the current TTS output immediately
- Cancel the in-flight LLM generation
- Process the new utterance with the context that the previous response was interrupted
class InterruptionHandler {
private currentResponseId: string | null = null;
private vadActive = false;
onUserSpeechDetected(pipeline: VoicePipeline): void {
if (this.currentResponseId) {
// User is speaking while agent is responding — interrupt
pipeline.emit("interrupt", {
responseId: this.currentResponseId,
reason: "user_speech_detected",
});
// Record partial response for context
const partialResponse = pipeline.getCurrentPartialResponse();
pipeline.addToHistory({
role: "system",
content: `[Agent was interrupted after saying: "${partialResponse}"]`,
});
}
}
}Interruption handling is where cascaded pipelines have a clear advantage over speech-to-speech models. With a cascaded pipeline, you can abort each stage independently. With a speech-to-speech model, you're at the mercy of the provider's interruption support.
Speech-to-Speech: The Emerging Alternative
OpenAI's Realtime API and Google's Gemini 2.5 with native audio offer a different approach: skip the cascade entirely. The model ingests audio and produces audio, preserving vocal nuance and emotional tone that text-mediated pipelines lose.
The tradeoffs are real:
| Aspect | Cascaded (STT→LLM→TTS) | Speech-to-Speech |
|---|---|---|
| Latency | ~465-800ms (optimized) | ~250-300ms |
| Debuggability | High — inspect text at each stage | Low — audio in, audio out |
| Component flexibility | Swap any stage independently | Locked to one provider |
| Cost | Sum of component costs | Single model pricing |
| Voice control | Full TTS customization | Provider's voice options |
| Business logic insertion | Between any stage | Before or after, not during |
For most production systems in 2026, cascaded pipelines remain the pragmatic choice. You get observability, flexibility, and the ability to insert business logic (like compliance checks on the transcript) between stages. But speech-to-speech latency advantages are compelling for use cases where sub-300ms response time matters more than debuggability.
Vision Integration: Beyond Image Classification
Adding vision to an AI agent means processing images and documents as first-class conversational context — not just classifying what's in a photo, but understanding how visual information relates to what the customer is saying, what their records show, and what action to take next.
Vision in production agents splits into two patterns: asynchronous (user uploads an image, agent processes it) and synchronous (real-time video analysis during a conversation). Nearly every team should start with async and add sync only when the use case demands it.
Async Vision: Image Understanding in Conversations
The most common pattern: a customer sends a photo during a chat or voice call, and the agent needs to understand it in context. This works with every major vision-capable LLM — GPT-4o, Gemini 2.5, Claude.
import OpenAI from "openai";
interface VisionAnalysis {
description: string;
extractedData: Record<string, unknown>;
confidence: number;
processingTimeMs: number;
}
interface VisionContext {
conversationHistory: Array<{ role: string; content: string }>;
domainHint: string; // "insurance_claim" | "product_support" | "medical"
}
async function analyzeImageInContext(
imageBuffer: Buffer,
mimeType: string,
context: VisionContext
): Promise<VisionAnalysis> {
const client = new OpenAI();
const startTime = Date.now();
// Build context-aware vision prompt
const recentContext = context.conversationHistory
.slice(-5)
.map((m) => `${m.role}: ${m.content}`)
.join("\n");
const domainPrompts: Record<string, string> = {
insurance_claim: `You are analyzing an image submitted as part of an insurance claim.
Extract: damage type, severity estimate (minor/moderate/severe), affected area,
and any visible policy-relevant details (serial numbers, addresses, dates).
Return structured JSON alongside your description.`,
product_support: `You are analyzing a product image for customer support.
Identify: product model, visible damage or defects, error indicators (LEDs, screens),
and any text visible on the product. Return structured JSON.`,
medical: `You are analyzing a medical document or image.
Extract: document type, patient identifiers (redact SSN/DOB),
key findings or diagnoses, and dates. Return structured JSON.`,
};
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: domainPrompts[context.domainHint] || domainPrompts.product_support,
},
{
role: "user",
content: [
{
type: "text",
text: `Recent conversation:\n${recentContext}\n\nThe customer just shared this image. Analyze it in the context of our conversation.`,
},
{
type: "image_url",
image_url: {
url: `data:${mimeType};base64,${imageBuffer.toString("base64")}`,
detail: "high",
},
},
],
},
],
response_format: { type: "json_object" },
max_tokens: 1000,
});
const result = JSON.parse(response.choices[0].message.content || "{}");
return {
description: result.description || "",
extractedData: result.extractedData || {},
confidence: result.confidence || 0.5,
processingTimeMs: Date.now() - startTime,
};
}Notice how the vision analysis is context-aware. The domain hint selects a specialized system prompt, and the recent conversation history is included so the model understands what the image relates to. An image of a cracked screen means something different in an insurance claim versus a product return.
Document Processing Pipeline
Documents — invoices, contracts, medical records, policy papers — are a special case of vision. They're high-information-density images where extraction accuracy directly impacts business outcomes.
For high-volume document processing, a dedicated pipeline outperforms general-purpose vision LLMs:
interface DocumentExtractionResult {
documentType: string;
pages: number;
extractedFields: Record<string, string | number | boolean>;
tables: Array<{ headers: string[]; rows: string[][] }>;
confidence: number;
flaggedIssues: string[];
}
async function processDocument(
pages: Buffer[],
expectedDocType: string
): Promise<DocumentExtractionResult> {
// Step 1: Classify document type (fast, cheap model)
const classification = await classifyDocument(pages[0]);
if (classification.type !== expectedDocType) {
return {
documentType: classification.type,
pages: pages.length,
extractedFields: {},
tables: [],
confidence: classification.confidence,
flaggedIssues: [
`Expected ${expectedDocType} but detected ${classification.type}`,
],
};
}
// Step 2: Extract structured data (vision LLM with schema)
const extractionPromises = pages.map((page, index) =>
extractPageData(page, expectedDocType, index)
);
const pageResults = await Promise.all(extractionPromises);
// Step 3: Merge and validate across pages
const merged = mergePageExtractions(pageResults);
// Step 4: Cross-reference extracted data with business rules
const issues = validateExtraction(merged, expectedDocType);
return {
documentType: expectedDocType,
pages: pages.length,
extractedFields: merged.fields,
tables: merged.tables,
confidence: merged.averageConfidence,
flaggedIssues: issues,
};
}
async function classifyDocument(
firstPage: Buffer
): Promise<{ type: string; confidence: number }> {
const client = new OpenAI();
const response = await client.chat.completions.create({
model: "gpt-4o-mini", // Fast, cheap — classification doesn't need the big model
messages: [
{
role: "system",
content: `Classify this document. Return JSON: { "type": one of ["invoice", "contract", "medical_record", "insurance_claim", "receipt", "id_document", "other"], "confidence": 0-1 }`,
},
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: `data:image/png;base64,${firstPage.toString("base64")}`,
detail: "low", // Low detail is sufficient for classification
},
},
],
},
],
response_format: { type: "json_object" },
max_tokens: 100,
});
return JSON.parse(response.choices[0].message.content || '{"type":"other","confidence":0}');
}The two-stage approach — fast classification followed by detailed extraction — saves money and time. Classification uses a smaller model with low-detail image processing. Only confirmed document types get the expensive high-detail extraction pass.
Real-Time Vision: Video Analysis During Calls
Synchronous vision — analyzing video frames during a live conversation — is the hardest modality to add. The challenges stack: frame sampling rate, processing latency, bandwidth, and the need to correlate visual events with speech in real time.
interface FrameAnalysis {
timestamp: number;
objects: Array<{ label: string; confidence: number; bbox: number[] }>;
sceneDescription: string;
actionDetected: string | null;
}
class RealTimeVisionProcessor {
private frameBuffer: Array<{ frame: Buffer; timestamp: number }> = [];
private analysisInterval: ReturnType<typeof setInterval> | null = null;
private lastAnalysis: FrameAnalysis | null = null;
// Sample 1-2 frames per second — more is waste for conversation context
private readonly SAMPLE_RATE_MS = 500;
// Only re-analyze if scene changed significantly
private readonly CHANGE_THRESHOLD = 0.3;
startProcessing(
videoStream: AsyncIterable<{ frame: Buffer; timestamp: number }>
): void {
this.analysisInterval = setInterval(async () => {
const latestFrame = this.frameBuffer[this.frameBuffer.length - 1];
if (!latestFrame) return;
// Skip analysis if scene hasn't changed significantly
if (this.lastAnalysis && !this.hasSceneChanged(latestFrame.frame)) {
return;
}
const analysis = await this.analyzeFrame(latestFrame);
this.lastAnalysis = analysis;
// Emit for fusion with other modalities
this.onAnalysis?.(analysis);
}, this.SAMPLE_RATE_MS);
// Consume video stream, keep only recent frames
(async () => {
for await (const frame of videoStream) {
this.frameBuffer.push(frame);
// Keep last 5 seconds of frames
const cutoff = Date.now() - 5000;
this.frameBuffer = this.frameBuffer.filter((f) => f.timestamp > cutoff);
}
})();
}
private hasSceneChanged(currentFrame: Buffer): boolean {
// In production: use perceptual hashing or frame differencing
// This is a placeholder — real implementations compare image hashes
return true;
}
private async analyzeFrame(
frame: { frame: Buffer; timestamp: number }
): Promise<FrameAnalysis> {
// Use a fast model — latency matters more than depth here
const client = new OpenAI();
const response = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
"Briefly describe what you see. Note any objects, text, gestures, or actions. JSON: { objects: [{label, confidence}], sceneDescription, actionDetected }",
},
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${frame.frame.toString("base64")}`,
detail: "low",
},
},
],
},
],
response_format: { type: "json_object" },
max_tokens: 200,
});
const result = JSON.parse(response.choices[0].message.content || "{}");
return { timestamp: frame.timestamp, ...result };
}
onAnalysis?: (analysis: FrameAnalysis) => void;
stop(): void {
if (this.analysisInterval) clearInterval(this.analysisInterval);
}
}The key optimization is change detection. Don't analyze every frame — most video in a customer service call shows a static scene. Only trigger full vision analysis when the scene changes meaningfully (customer holds up a new document, points at something different, moves to a different area).
Cross-Modal Fusion: Where the Real Complexity Lives
Fusion is the process of combining information from voice, vision, and text into a unified context that the reasoning engine can act on — and getting it wrong means your agent responds to what the customer said five seconds ago while looking at an image they sent thirty seconds ago.
Three fusion strategies exist, each with different tradeoffs:
Early Fusion
Combine raw representations from all modalities before any processing. This is what native multimodal models like GPT-4o do internally — they process audio tokens, image tokens, and text tokens in a single attention mechanism.
Advantage: The model captures cross-modal interactions that separate processing would miss (tone of voice + facial expression = more accurate sentiment than either alone).
Disadvantage: Computationally expensive, locked to one provider, and impossible to debug intermediate representations.
Late Fusion
Process each modality independently, then combine the results at the decision level.
Advantage: Modular, debuggable, and each modality can use a specialized model.
Disadvantage: Misses cross-modal correlations. "This one" (speech) + pointing gesture (vision) can't be resolved if they're processed separately.
Hybrid Fusion
Process modalities partially independently, then fuse at multiple points — early enough to capture cross-modal references, late enough to maintain modularity.
This is what most production systems use. Here's the architecture:
Here's the TypeScript implementation of the fusion layer:
interface ModalityEvent {
modality: "voice" | "vision" | "text";
timestamp: number;
data: VoiceEvent | VisionEvent | TextEvent;
}
interface VoiceEvent {
transcript: string;
sentiment: "positive" | "neutral" | "negative" | "frustrated";
confidence: number;
}
interface VisionEvent {
description: string;
objects: Array<{ label: string; confidence: number }>;
extractedText: string | null;
documentType: string | null;
}
interface TextEvent {
message: string;
attachments: string[];
}
interface FusedContext {
timestamp: number;
transcript: string | null;
visionContext: string | null;
textMessage: string | null;
crossModalReferences: CrossModalReference[];
sentiment: string;
unifiedSummary: string;
}
interface CrossModalReference {
type: "deictic" | "anaphoric" | "temporal";
sourceModality: string;
targetModality: string;
description: string;
}
class ModalityFusion {
private eventBuffer: ModalityEvent[] = [];
private readonly ALIGNMENT_WINDOW_MS = 3000; // 3-second window for cross-modal alignment
addEvent(event: ModalityEvent): void {
this.eventBuffer.push(event);
// Prune events older than 30 seconds
const cutoff = Date.now() - 30000;
this.eventBuffer = this.eventBuffer.filter((e) => e.timestamp > cutoff);
}
fuse(): FusedContext {
const now = Date.now();
const recentEvents = this.eventBuffer.filter(
(e) => now - e.timestamp < this.ALIGNMENT_WINDOW_MS
);
// Group by modality
const voiceEvents = recentEvents.filter(
(e) => e.modality === "voice"
) as Array<ModalityEvent & { data: VoiceEvent }>;
const visionEvents = recentEvents.filter(
(e) => e.modality === "vision"
) as Array<ModalityEvent & { data: VisionEvent }>;
const textEvents = recentEvents.filter(
(e) => e.modality === "text"
) as Array<ModalityEvent & { data: TextEvent }>;
// Resolve cross-modal references
const references = this.resolveCrossModalReferences(
voiceEvents,
visionEvents,
textEvents
);
// Build unified context
const latestVoice = voiceEvents[voiceEvents.length - 1];
const latestVision = visionEvents[visionEvents.length - 1];
const latestText = textEvents[textEvents.length - 1];
return {
timestamp: now,
transcript: latestVoice?.data.transcript || null,
visionContext: latestVision?.data.description || null,
textMessage: latestText?.data.message || null,
crossModalReferences: references,
sentiment: latestVoice?.data.sentiment || "neutral",
unifiedSummary: this.buildUnifiedSummary(
latestVoice?.data,
latestVision?.data,
latestText?.data,
references
),
};
}
private resolveCrossModalReferences(
voiceEvents: Array<ModalityEvent & { data: VoiceEvent }>,
visionEvents: Array<ModalityEvent & { data: VisionEvent }>,
_textEvents: Array<ModalityEvent & { data: TextEvent }>
): CrossModalReference[] {
const references: CrossModalReference[] = [];
// Detect deictic references: "this", "that", "here" in speech near image events
for (const voice of voiceEvents) {
const deicticPatterns = /\b(this|that|these|those|here|there|it)\b/gi;
if (deicticPatterns.test(voice.data.transcript)) {
// Find vision events within alignment window
const nearbyVision = visionEvents.filter(
(v) => Math.abs(v.timestamp - voice.timestamp) < this.ALIGNMENT_WINDOW_MS
);
if (nearbyVision.length > 0) {
references.push({
type: "deictic",
sourceModality: "voice",
targetModality: "vision",
description: `Speech "${voice.data.transcript}" likely references visual context: "${nearbyVision[0].data.description}"`,
});
}
}
}
return references;
}
private buildUnifiedSummary(
voice: VoiceEvent | undefined,
vision: VisionEvent | undefined,
text: TextEvent | undefined,
references: CrossModalReference[]
): string {
const parts: string[] = [];
if (voice) {
parts.push(`Customer said: "${voice.transcript}" (sentiment: ${voice.sentiment})`);
}
if (vision) {
parts.push(`Visual context: ${vision.description}`);
if (vision.extractedText) {
parts.push(`Text in image: "${vision.extractedText}"`);
}
}
if (text) {
parts.push(`Text message: "${text.message}"`);
}
if (references.length > 0) {
parts.push(
`Cross-modal links: ${references.map((r) => r.description).join("; ")}`
);
}
return parts.join("\n");
}
}Temporal Synchronization: The Hardest Problem
Voice arrives as a continuous stream at 16kHz. Images arrive sporadically — when a customer uploads one or when the system samples a video frame. Text messages arrive asynchronously. Aligning these streams is genuinely difficult.
The alignment window approach shown above (3-second window for cross-modal references) works for most customer service scenarios. But it breaks when there's significant latency between modalities. If STT takes 500ms and vision processing takes 2 seconds, a customer who says "look at this" and simultaneously sends a photo will have events that are 1.5 seconds apart in your system's timeline, even though they were simultaneous from the customer's perspective.
Production systems need to track event origin time (when the customer acted) separately from processing completion time (when the system finished analyzing it). Timestamp everything at capture, not at completion.
Building a Multimodal Agent Architecture
The architecture that works in production separates perception, fusion, reasoning, and action into independent layers — each with its own scaling characteristics, failure modes, and monitoring hooks — connected by an orchestrator that manages conversation state across all modalities.
Here's the orchestrator that ties the layers together:
import { z } from "zod";
// Configuration schema for type safety
const MultimodalAgentConfigSchema = z.object({
agentId: z.string(),
workspaceId: z.string(),
modalities: z.object({
voice: z.object({
enabled: z.boolean(),
sttModel: z.string().default("deepgram-nova-3"),
ttsModel: z.string().default("elevenlabs-flash-v2.5"),
ttsVoice: z.string().default("rachel"),
interruptionEnabled: z.boolean().default(true),
}),
vision: z.object({
enabled: z.boolean(),
model: z.string().default("gpt-4o"),
maxImagesPerConversation: z.number().default(10),
realtimeVideo: z.boolean().default(false),
frameSampleRateMs: z.number().default(500),
}),
text: z.object({
enabled: z.boolean(),
}),
}),
reasoning: z.object({
model: z.string().default("gpt-4o"),
systemPrompt: z.string(),
temperature: z.number().default(0.3),
maxTokens: z.number().default(2048),
tools: z.array(z.string()).default([]),
}),
latencyBudget: z.object({
voiceResponseMs: z.number().default(800),
visionProcessingMs: z.number().default(5000),
toolExecutionMs: z.number().default(10000),
}),
});
type MultimodalAgentConfig = z.infer<typeof MultimodalAgentConfigSchema>;
class MultimodalAgent {
private config: MultimodalAgentConfig;
private voicePipeline: VoicePipeline | null = null;
private visionProcessor: RealTimeVisionProcessor | null = null;
private fusion: ModalityFusion;
private conversationId: string;
private metrics: AgentMetrics;
constructor(config: MultimodalAgentConfig) {
this.config = MultimodalAgentConfigSchema.parse(config);
this.fusion = new ModalityFusion();
this.conversationId = crypto.randomUUID();
this.metrics = new AgentMetrics(config.agentId);
}
async initialize(): Promise<void> {
// Initialize only enabled modalities
if (this.config.modalities.voice.enabled) {
this.voicePipeline = new VoicePipeline({
sttProvider: this.createSTTProvider(),
llmProvider: this.createLLMProvider(),
ttsProvider: this.createTTSProvider(),
vadSensitivity: 0.5,
interruptionEnabled: this.config.modalities.voice.interruptionEnabled,
maxResponseLatencyMs: this.config.latencyBudget.voiceResponseMs,
});
this.voicePipeline.on("transcript", (event) => {
this.fusion.addEvent({
modality: "voice",
timestamp: Date.now(),
data: {
transcript: event.text,
sentiment: "neutral", // Would come from sentiment analysis
confidence: event.confidence,
},
});
this.metrics.recordLatency("stt", event.latencyMs);
});
this.voicePipeline.on("firstAudio", (event) => {
this.metrics.recordLatency("voice_total", event.totalLatencyMs);
});
}
if (
this.config.modalities.vision.enabled &&
this.config.modalities.vision.realtimeVideo
) {
this.visionProcessor = new RealTimeVisionProcessor();
this.visionProcessor.onAnalysis = (analysis) => {
this.fusion.addEvent({
modality: "vision",
timestamp: analysis.timestamp,
data: {
description: analysis.sceneDescription,
objects: analysis.objects,
extractedText: null,
documentType: null,
},
});
};
}
}
async handleImageUpload(
imageBuffer: Buffer,
mimeType: string
): Promise<string> {
if (!this.config.modalities.vision.enabled) {
return "Image processing is not available for this agent.";
}
const startTime = Date.now();
const analysis = await analyzeImageInContext(imageBuffer, mimeType, {
conversationHistory: this.getRecentHistory(),
domainHint: this.inferDomain(),
});
this.fusion.addEvent({
modality: "vision",
timestamp: Date.now(),
data: {
description: analysis.description,
objects: [],
extractedText: JSON.stringify(analysis.extractedData),
documentType: null,
},
});
this.metrics.recordLatency("vision", Date.now() - startTime);
// Build response using fused context
const fusedContext = this.fusion.fuse();
return this.generateResponse(fusedContext);
}
async handleTextMessage(message: string): Promise<string> {
this.fusion.addEvent({
modality: "text",
timestamp: Date.now(),
data: {
message,
attachments: [],
},
});
const fusedContext = this.fusion.fuse();
return this.generateResponse(fusedContext);
}
private async generateResponse(context: FusedContext): Promise<string> {
const systemPrompt = `${this.config.reasoning.systemPrompt}
Current multimodal context:
${context.unifiedSummary}
${
context.crossModalReferences.length > 0
? `Note: The customer appears to be referencing visual content in their speech.
${context.crossModalReferences.map((r) => r.description).join("\n")}`
: ""
}`;
// Call LLM with fused context
const client = new OpenAI();
const response = await client.chat.completions.create({
model: this.config.reasoning.model,
messages: [
{ role: "system", content: systemPrompt },
...this.getRecentHistory(),
{
role: "user",
content: context.transcript || context.textMessage || "[Image uploaded]",
},
],
temperature: this.config.reasoning.temperature,
max_tokens: this.config.reasoning.maxTokens,
});
return response.choices[0].message.content || "";
}
private getRecentHistory(): Array<{ role: string; content: string }> {
// Return last 20 turns — enough context without exploding tokens
return [];
}
private inferDomain(): string {
// Infer from agent configuration or conversation content
return "product_support";
}
private createSTTProvider(): STTProvider {
// Factory based on config.modalities.voice.sttModel
return {} as STTProvider;
}
private createLLMProvider(): LLMProvider {
return {} as LLMProvider;
}
private createTTSProvider(): TTSProvider {
return {} as TTSProvider;
}
}
class AgentMetrics {
private latencies: Map<string, number[]> = new Map();
constructor(private agentId: string) {}
recordLatency(stage: string, ms: number): void {
const existing = this.latencies.get(stage) || [];
existing.push(ms);
this.latencies.set(stage, existing.slice(-100)); // Keep last 100
}
getP95(stage: string): number {
const values = this.latencies.get(stage) || [];
if (values.length === 0) return 0;
const sorted = [...values].sort((a, b) => a - b);
return sorted[Math.floor(sorted.length * 0.95)];
}
getSummary(): Record<string, { p50: number; p95: number; count: number }> {
const summary: Record<string, { p50: number; p95: number; count: number }> = {};
for (const [stage, values] of this.latencies) {
const sorted = [...values].sort((a, b) => a - b);
summary[stage] = {
p50: sorted[Math.floor(sorted.length * 0.5)],
p95: sorted[Math.floor(sorted.length * 0.95)],
count: values.length,
};
}
return summary;
}
}Graceful Degradation
Production multimodal agents must handle partial failures. When one modality is degraded, the agent should continue functioning with the remaining modalities — not crash.
interface ModalityHealth {
voice: { status: "healthy" | "degraded" | "down"; lastCheck: number };
vision: { status: "healthy" | "degraded" | "down"; lastCheck: number };
text: { status: "healthy" | "degraded" | "down"; lastCheck: number };
}
class GracefulDegradation {
private health: ModalityHealth;
private circuitBreakers: Map<string, CircuitBreaker> = new Map();
constructor() {
this.health = {
voice: { status: "healthy", lastCheck: Date.now() },
vision: { status: "healthy", lastCheck: Date.now() },
text: { status: "healthy", lastCheck: Date.now() },
};
}
async withFallback<T>(
modality: keyof ModalityHealth,
primary: () => Promise<T>,
fallback: () => Promise<T>,
timeout: number
): Promise<T> {
const breaker = this.getCircuitBreaker(modality);
if (breaker.isOpen()) {
this.health[modality].status = "down";
return fallback();
}
try {
const result = await Promise.race([
primary(),
this.createTimeout<T>(timeout, modality),
]);
breaker.recordSuccess();
this.health[modality].status = "healthy";
return result;
} catch (error) {
breaker.recordFailure();
this.health[modality].status =
breaker.isOpen() ? "down" : "degraded";
return fallback();
}
}
private createTimeout<T>(ms: number, modality: string): Promise<T> {
return new Promise((_, reject) =>
setTimeout(() => reject(new Error(`${modality} timeout after ${ms}ms`)), ms)
);
}
private getCircuitBreaker(modality: string): CircuitBreaker {
if (!this.circuitBreakers.has(modality)) {
this.circuitBreakers.set(
modality,
new CircuitBreaker({ failureThreshold: 3, resetTimeoutMs: 30000 })
);
}
return this.circuitBreakers.get(modality)!;
}
}
class CircuitBreaker {
private failures = 0;
private lastFailure = 0;
private config: { failureThreshold: number; resetTimeoutMs: number };
constructor(config: { failureThreshold: number; resetTimeoutMs: number }) {
this.config = config;
}
isOpen(): boolean {
if (this.failures < this.config.failureThreshold) return false;
// Allow retry after reset timeout
if (Date.now() - this.lastFailure > this.config.resetTimeoutMs) {
this.failures = 0;
return false;
}
return true;
}
recordSuccess(): void {
this.failures = 0;
}
recordFailure(): void {
this.failures++;
this.lastFailure = Date.now();
}
}Each modality gets its own circuit breaker. Three consecutive failures open the circuit, routing requests to the fallback for 30 seconds before retrying. The fallback for voice might be text-based chat. The fallback for vision might be asking the customer to describe what they see. The fallback for text might be a voice prompt.
Latency Budgets for Real-Time Multimodal
Every millisecond matters in voice interactions, and adding vision or document processing to a voice call creates latency pressure that can break the conversational experience — so you need explicit budgets for every processing stage, with hard cutoffs and async offloading for anything that would blow the budget.
Here's a realistic latency budget for a multimodal voice agent:
| Stage | Budget | Strategy |
|---|---|---|
| VAD | 10ms | Local, negligible |
| STT (streaming) | 150ms | Streaming partial results |
| Vision (async image) | 2-5s | Process in background, inject context when ready |
| Vision (frame analysis) | 500ms | Low-detail, fast model, skip unchanged frames |
| LLM (time to first token) | 300ms | Use fast providers (Groq, Fireworks) or smaller models |
| Tool execution | 1-10s | Async with progress updates |
| TTS (time to first audio) | 150ms | Streaming synthesis, sentence-level buffering |
| Voice round-trip | <800ms | STT + LLM TTFT + TTS TTFA |
The key insight is that not everything needs to be synchronous. Image analysis that takes 3 seconds doesn't block the voice pipeline — the agent acknowledges the image immediately ("I can see your photo — let me take a closer look") and injects the vision context into the next turn.
interface LatencyBudget {
stage: string;
budgetMs: number;
actual: number;
overBudget: boolean;
}
class LatencyMonitor {
private budgets: Map<string, number> = new Map([
["stt", 200],
["llm_ttft", 400],
["tts_ttfa", 200],
["vision_async", 5000],
["vision_realtime", 500],
["tool_execution", 10000],
["voice_total", 800],
]);
private measurements: Map<string, number[]> = new Map();
record(stage: string, durationMs: number): LatencyBudget {
const budget = this.budgets.get(stage) || Infinity;
const existing = this.measurements.get(stage) || [];
existing.push(durationMs);
this.measurements.set(stage, existing.slice(-1000));
const result = {
stage,
budgetMs: budget,
actual: durationMs,
overBudget: durationMs > budget,
};
if (result.overBudget) {
console.warn(
`[latency] ${stage} over budget: ${durationMs}ms (budget: ${budget}ms)`
);
}
return result;
}
getReport(): Array<{
stage: string;
p50: number;
p95: number;
budget: number;
complianceRate: number;
}> {
const report = [];
for (const [stage, values] of this.measurements) {
const sorted = [...values].sort((a, b) => a - b);
const budget = this.budgets.get(stage) || Infinity;
const withinBudget = values.filter((v) => v <= budget).length;
report.push({
stage,
p50: sorted[Math.floor(sorted.length * 0.5)],
p95: sorted[Math.floor(sorted.length * 0.95)],
budget,
complianceRate: withinBudget / values.length,
});
}
return report;
}
}Production monitoring dashboards should surface these latency budgets alongside conversation quality metrics. A 95th percentile STT latency creeping above 300ms might not crash anything, but it degrades the experience long before users explicitly complain.
Real-World Multimodal Pattern: Insurance Claims
Insurance claims processing is one of the strongest production use cases for multimodal agents — customers describe damage over the phone, text photos as evidence, and the agent fuses both modalities to triage severity, flag policy-relevant details, and route to the right adjuster team in a single conversation.
Real deployments using this pattern have cut First Notice of Loss (FNOL) completion times from 18 minutes to under 6 and shortened overall claim cycles by 22%. Here's the architecture that makes it work.
interface ClaimContext {
claimId: string;
policyNumber: string;
damagePhotos: Array<{
url: string;
analysis: VisionAnalysis;
timestamp: number;
}>;
voiceTranscript: string[];
extractedDamageDetails: {
type: string;
severity: "minor" | "moderate" | "severe";
affectedAreas: string[];
estimatedCost: number | null;
};
sentiment: string;
}
class InsuranceClaimsAgent {
private agent: MultimodalAgent;
private claimContext: ClaimContext;
constructor(policyNumber: string) {
this.agent = new MultimodalAgent({
agentId: "insurance-claims-v2",
workspaceId: "ws-insurance-co",
modalities: {
voice: {
enabled: true,
sttModel: "deepgram-nova-3",
ttsModel: "elevenlabs-flash-v2.5",
ttsVoice: "professional-empathetic",
interruptionEnabled: true,
},
vision: {
enabled: true,
model: "gpt-4o",
maxImagesPerConversation: 20,
realtimeVideo: false,
frameSampleRateMs: 500,
},
text: { enabled: true },
},
reasoning: {
model: "gpt-4o",
systemPrompt: this.buildClaimsPrompt(),
temperature: 0.2, // Low temperature for consistent triage
maxTokens: 2048,
tools: ["lookup_policy", "create_claim", "route_to_adjuster", "estimate_damage"],
},
latencyBudget: {
voiceResponseMs: 800,
visionProcessingMs: 5000,
toolExecutionMs: 15000,
},
});
this.claimContext = {
claimId: crypto.randomUUID(),
policyNumber,
damagePhotos: [],
voiceTranscript: [],
extractedDamageDetails: {
type: "unknown",
severity: "moderate",
affectedAreas: [],
estimatedCost: null,
},
sentiment: "neutral",
};
}
private buildClaimsPrompt(): string {
return `You are an insurance claims assistant for a property insurance company.
Your role:
1. Gather information about the damage (type, cause, extent, timing)
2. Analyze any photos the customer shares for damage assessment
3. Look up the customer's policy to verify coverage
4. Create a preliminary claim and route to the appropriate adjuster team
Tone: Empathetic but efficient. The customer is likely stressed.
When analyzing damage photos:
- Note specific damage indicators (water stains, cracks, burn marks)
- Estimate severity based on visible extent
- Flag any safety concerns (structural damage, exposed wiring, mold)
- Cross-reference verbal description with visual evidence
If verbal description and photo evidence conflict, ask clarifying questions.
Never make coverage promises — say "based on your policy" and route to adjuster.`;
}
async onPhotoReceived(imageBuffer: Buffer, mimeType: string): Promise<void> {
// Acknowledge immediately — don't block voice for vision processing
this.agent.handleTextMessage(
"[System: Customer shared a photo. Analyzing now.]"
);
// Process image with claims-specific context
const analysis = await analyzeImageInContext(imageBuffer, mimeType, {
conversationHistory: this.claimContext.voiceTranscript.map((t) => ({
role: "user",
content: t,
})),
domainHint: "insurance_claim",
});
this.claimContext.damagePhotos.push({
url: `claim-${this.claimContext.claimId}-photo-${this.claimContext.damagePhotos.length}`,
analysis,
timestamp: Date.now(),
});
// Update damage assessment with visual evidence
if (analysis.extractedData.severity) {
this.claimContext.extractedDamageDetails.severity =
analysis.extractedData.severity as "minor" | "moderate" | "severe";
}
// Inject vision context into conversation
await this.agent.handleTextMessage(
`[System: Photo analysis complete. ${analysis.description}. Severity: ${analysis.extractedData.severity || "undetermined"}. Confidence: ${(analysis.confidence * 100).toFixed(0)}%]`
);
}
}This pattern — immediate acknowledgment, async processing, context injection — is how production multimodal agents handle vision without breaking the voice experience. The customer never waits in silence while their photo processes. They hear "I can see your photo" within the normal conversational rhythm, and the analysis results flow into the agent's context for the next response.
Monitoring Multimodal Agents
Monitoring a multimodal agent requires tracking per-modality latency, cross-modal alignment accuracy, and fallback rates alongside the usual conversation quality metrics — because a failure in vision processing can silently degrade the entire experience even when voice works perfectly.
The metrics that matter:
| Metric | What It Reveals | Target |
|---|---|---|
| Voice round-trip latency (P95) | Conversational quality | <800ms |
| STT word error rate | Transcript accuracy | <10% |
| Vision processing time | Image turnaround | <5s async, <500ms real-time |
| Cross-modal alignment accuracy | Fusion quality | Manual eval |
| Modality fallback rate | System reliability | <5% |
| Interruption recovery time | Conversation naturalness | <200ms |
Analytics dashboards should track these per-agent and per-conversation, surfacing degradation trends before they become customer complaints. The cross-modal alignment metric is hardest to automate — it typically requires human evaluation of conversations where the customer referenced images while speaking.

Sentiment Analysis
Last 7 days
Conversation memory is especially important for multimodal agents. When a customer calls back about a claim they started yesterday, the agent should remember not just what was said but what photos were shared and what the vision analysis revealed.
What's Coming: The Multimodal Frontier
The next 12-18 months will bring native multimodal models that rival cascaded pipeline quality, edge-deployed vision models that process on-device, and embodied agents that extend fusion beyond software into physical environments.
Native multimodal models are catching up. Gemini 2.5's native audio understanding and OpenAI's gpt-realtime model are closing the gap between cascaded pipelines and end-to-end models. When these models reliably match cascaded pipeline latency with better quality, the architecture simplifies dramatically. But we're not there yet — debuggability and component independence still favor cascaded approaches for production systems.
Edge multimodal is becoming viable. Models like MiniCPM-V (8B parameters) run on mobile phones while outperforming GPT-4V on multiple benchmarks. This means vision processing can happen on the customer's device before data ever hits your servers — reducing latency and addressing privacy concerns for sensitive documents.
Embodied multimodal agents are emerging. Beyond voice and vision in software, multimodal agents are moving into robotics and physical spaces. The same fusion architecture that aligns speech and images will need to align proprioception, spatial awareness, and physical actions. The temporal synchronization patterns we've covered scale to these domains.
The infrastructure you build today for voice-vision-text fusion is the foundation for whatever modalities come next. The patterns — perception layers, temporal alignment, hybrid fusion, graceful degradation, latency budgets — are modality-agnostic. Get them right, and adding a new input channel is an engineering task, not an architectural redesign.
For teams building multimodal agents with MCP-based tool infrastructure, the tool execution layer we discussed in AI Agent Tools connects directly to the action layer of the multimodal architecture. If you haven't built an MCP server yet, MCP Explained walks through the protocol fundamentals. The agent reasons across modalities, decides on an action, and executes it through the same tool management layer that single-modality agents use.
Start with text. Add voice when you need it. Add vision when the use case demands it. And build the fusion layer to accommodate modalities that don't exist yet.
- Multimodal AI Market Size (2026-2031) — Mordor Intelligence
- Agentic AI Market Size and Growth Forecast — MarketsandMarkets
- Core Latency in AI Voice Agents — Twilio Engineering
- Voice AI Infrastructure: Building Real-Time Speech Agents — Introl
- Cracking the Sub-1-Second Voice Loop — CloudX (30+ Stack Benchmarks)
- Engineering for Real-Time Voice Agent Latency — Cresta
- Building the Lowest Latency Voice Agent — AssemblyAI
- Speech Recognition Accuracy: Production Metrics — Deepgram
- Whisper vs Deepgram 2025: STT Comparison — Deepgram
- Introducing gpt-realtime and Realtime API Updates — OpenAI
- Gemini 2.5 Native Audio Capabilities — Google
- GPT-4o Vision Evaluation on Standard CV Tasks — arXiv
- 8 Best Multimodal AI Platforms Compared — Index.dev
- Architecture, Trends & Deployment of Multimodal AI Agents — Kanerika
- The Rise of Multimodal AI Agents — XenonStack
- Latency Optimization — ElevenLabs Documentation
- Pipecat: Open Source Voice and Multimodal AI Framework — GitHub
- AI-Powered Insurance Claims Processing — Multimodal.dev
Build multimodal agents with production infrastructure
Chanl provides the tool management, monitoring, and memory infrastructure that multimodal agents need — so you can focus on the perception and fusion logic.
Start building freeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



