ChanlChanl
Agent Architecture

SSE vs WebSockets for Voice AI: Choosing the Right Real-Time Transport

SSE and WebSockets solve different problems in voice AI pipelines. Learn when each transport wins, how to avoid the buffering traps that silently break streaming, and why pipeline parallelism matters more than raw model speed.

LDLucas DalamartaEngineering LeadFollow
March 24, 2026
14 min read
Office workers are busy working on computers. - Photo by TECNIC Bioprocess Solutions on Unsplash

Voice AI latency is not just a performance metric. It's a product threshold. When your agent takes longer than 300ms to begin responding, the caller's brain switches from conversation mode to waiting mode. That perceptual shift degrades the interaction even when accuracy is identical. The architecture that determines whether you land on the right side of that threshold is built on two streaming primitives: Server-Sent Events (SSE) and WebSockets. Choosing the wrong one doesn't just add latency. It creates architectural constraints you'll fight for months.

This article explains what each transport actually does, where each one breaks down in real voice scenarios, and how pipeline parallelism makes a bigger difference than raw model speed.

Why streaming transports matter for voice AI

The core insight is that batch processing and real-time conversation are fundamentally incompatible. Traditional request-response architecture (wait for the full question, process it entirely, generate a complete answer, return it all at once) produces unacceptable latency for voice, because each stage must finish before the next can begin. Streaming transports exist to pipeline these stages so they overlap instead of queue.

Here's how the latency math works without streaming:

StageNaive duration
Speech-to-text (batch)200-400ms
LLM generation (wait for full response)1-4 seconds
Text-to-speech (full response synthesis)500ms-1.5s
Network round trips40-100ms
Total~2-6 seconds

And with pipeline parallelism through streaming:

StageStreaming duration
STT (streaming, start on partial audio)80-120ms to first transcript
LLM (streaming, first token latency)100-150ms
TTS (streaming, first audio chunk)60-100ms
Network (concurrent, not sequential)20-50ms overhead
Total perceived260-420ms

The stages still take the same total time. Streaming doesn't make the models faster. What it does is overlap them. The user starts hearing audio before the LLM has finished generating the full response, because TTS is synthesizing the first few sentences while the LLM is still working on the rest. That's pipeline parallelism, and it's where the 40-60% latency reduction comes from.

stream audio chunks partial transcript (t=80ms) first token (t=200ms) first audio chunk (t=280ms) complete transcript (t=200ms) continued tokens (ongoing) continued audio (ongoing) User Audio Speech-to-Text Language Model Text-to-Speech Speaker
Pipeline parallelism: stages overlap instead of waiting for each other

SSE: the right tool for server-to-client streaming

Server-Sent Events are a browser-native protocol for one-directional streaming over standard HTTP. The server opens a persistent connection and pushes events as they're generated. The client listens. No bidirectional channel, no protocol upgrade. Just HTTP with Content-Type: text/event-stream and a persistent keep-alive.

For AI text streaming, where the model generates tokens and you want them to appear progressively in the UI, SSE is the default choice. It's what OpenAI, Anthropic, and nearly every AI API uses internally. It works over HTTP/2 (which multiplexes connections), passes through standard reverse proxies, and the browser's EventSource API reconnects automatically on drop.

Here's a minimal SSE server that streams from an LLM:

typescript
import express from "express";
 
const app = express();
app.use(express.json());
 
app.post("/api/chat/stream", async (req, res) => {
  const { agentId, messages } = req.body;
 
  // These three headers establish the SSE connection
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  // Critical for Nginx — without this, responses buffer until gzip threshold
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  let seq = 0;
 
  try {
    // Stream from your LLM provider
    const stream = await getLLMStream(agentId, messages);
 
    for await (const chunk of stream) {
      if (chunk.type === "token") {
        // SSE format: "id: N\ndata: <payload>\n\n"
        // The id enables reconnection resume via Last-Event-ID
        res.write(
          `id: ${++seq}\ndata: ${JSON.stringify({
            type: "token",
            content: chunk.content,
            seq,
          })}\n\n`
        );
 
        // Handle backpressure — if the client can't keep up, pause the stream
        if (!res.write("")) {
          await new Promise<void>((resolve) => res.once("drain", resolve));
        }
      }
 
      if (chunk.type === "done") {
        res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
      }
    }
  } catch (error) {
    res.write(
      `data: ${JSON.stringify({
        type: "error",
        message: error instanceof Error ? error.message : "Unknown error",
      })}\n\n`
    );
  }
 
  res.end();
});

The buffering traps that silently break SSE

This is where most implementations go wrong. SSE works perfectly in local development, then tokens arrive in bursts when deployed behind a proxy. The culprit is almost always response buffering at one of these layers:

nginx
# Nginx SSE configuration — every directive here matters
location /api/chat/stream {
    proxy_pass http://backend:3000;
 
    # Disable buffering — without this, Nginx holds chunks until its buffer fills
    proxy_buffering off;
    proxy_cache off;
 
    # HTTP/1.1 keepalive for persistent connection
    proxy_http_version 1.1;
    proxy_set_header Connection '';
 
    # Extend timeouts for long-running streams and tool calls
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
 
    # Gzip buffers until it has enough data to compress — kills streaming
    gzip off;
}

The common failure modes and their fixes:

ProblemCauseFix
Tokens arrive in 500ms batchesproxy_buffering on (Nginx default)proxy_buffering off
Smooth locally, batchy in prodGzip compression bufferinggzip off on streaming endpoints
Stream dies after 100s of silenceCloudflare idle timeoutSend ": keepalive\n\n" every 30s
Tokens burst after tool callsALB 60s idle timeoutIncrease timeout or send heartbeats
Full response arrives at onceCDN response cachingCache-Control: no-cache header

For Cloudflare specifically, it terminates connections that go silent for 100 seconds, which matters for voice agents running tool calls. Send SSE comment heartbeats during tool execution:

typescript
// Keep Cloudflare alive while a tool call is running
const heartbeat = setInterval(() => {
  if (!res.writableEnded) {
    res.write(": keepalive\n\n");
  }
}, 30_000);
 
try {
  // ... stream tokens, execute tools, etc.
} finally {
  clearInterval(heartbeat);
}

Consuming SSE in the browser

The built-in EventSource only handles GET. For POST (required when you need to send message history or auth headers), use fetch with a streaming body reader:

typescript
async function streamChat(
  agentId: string,
  messages: Array<{ role: string; content: string }>,
  onToken: (content: string) => void
) {
  const controller = new AbortController();
  const start = Date.now();
  let ttft: number | null = null;
 
  const response = await fetch("/api/chat/stream", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${getToken()}`,
    },
    body: JSON.stringify({ agentId, messages }),
    signal: controller.signal,
  });
 
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    buffer += decoder.decode(value, { stream: true });
 
    // SSE lines are separated by double newlines
    const lines = buffer.split("\n");
    buffer = lines.pop()!; // Hold incomplete line in buffer
 
    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      if (line === "data: [DONE]") return;
 
      const data = JSON.parse(line.slice(6));
 
      if (data.type === "token") {
        if (!ttft) {
          ttft = Date.now() - start;
          console.log(`Time to first token: ${ttft}ms`);
        }
        onToken(data.content);
      }
    }
  }
 
  // Expose cancel for the UI
  return () => controller.abort();
}

WebSockets: when you need bidirectional real-time

WebSockets open a persistent, full-duplex TCP connection where either side can send messages at any time. The upgrade from HTTP happens once on connect, then all subsequent messages travel over the same persistent socket.

For voice AI, the cases that require WebSockets are specific:

  1. Barge-in / interruption handling. The user starts talking while the agent is still speaking. Your system needs to simultaneously receive that audio, stop TTS playback, cancel the in-flight LLM generation, and re-route to STT, all triggered by a client event that arrives while the server is actively streaming audio back. SSE can't handle this because the client has no channel to send the interruption signal.

  2. Continuous audio streaming. Sending microphone audio in real time requires continuous client-to-server transmission. SSE is server-to-client only.

  3. Multi-turn coordination. Some architectures need the client to send semantic events mid-stream, signaling that a user nodded, confirming a detected intent, or injecting tool results from the client side.

{type: "audio", chunk: <bytes>} stream transcript tokens {type: "audio", chunk: <tts_bytes>} {type: "barge_in"} abort() flush() {type: "listening"} {type: "audio", chunk: <new_audio>} User starts speaking mid-response User Client WebSocket Server Language Model TTS Engine
WebSocket voice: barge-in signal travels opposite direction to the audio stream

Here's a WebSocket server that handles barge-in:

typescript
import { WebSocketServer, WebSocket } from "ws";
 
const wss = new WebSocketServer({ port: 8080 });
 
wss.on("connection", (ws: WebSocket) => {
  let activeController: AbortController | null = null;
  let isStreaming = false;
 
  ws.on("message", async (raw: Buffer) => {
    const message = JSON.parse(raw.toString());
 
    if (message.type === "barge_in") {
      // User started speaking — immediately cancel ongoing generation
      if (activeController) {
        activeController.abort();
        activeController = null;
        isStreaming = false;
      }
      ws.send(JSON.stringify({ type: "listening" }));
      return;
    }
 
    if (message.type === "audio_chunk") {
      // Route audio to your STT provider (Deepgram, AssemblyAI, etc.)
      await forwardToSTT(message.data);
      return;
    }
 
    if (message.type === "transcript") {
      // Full transcript ready — generate and stream response
      activeController = new AbortController();
      isStreaming = true;
 
      try {
        const stream = await getLLMStream(
          message.agentId,
          message.transcript,
          { signal: activeController.signal }
        );
 
        for await (const chunk of stream) {
          if (!isStreaming) break; // Barge-in may have cleared this flag
 
          if (chunk.type === "token") {
            // Feed tokens to TTS, then stream audio back
            const audioChunk = await synthesize(chunk.content);
            ws.send(
              JSON.stringify({
                type: "audio",
                data: audioChunk,
              })
            );
          }
        }
 
        if (isStreaming) {
          ws.send(JSON.stringify({ type: "done" }));
        }
      } catch (err: any) {
        if (err.name === "AbortError") return; // Expected on barge-in
        ws.send(JSON.stringify({ type: "error", message: err.message }));
      } finally {
        activeController = null;
        isStreaming = false;
      }
    }
  });
 
  ws.on("close", () => {
    activeController?.abort();
  });
});

The decision framework: SSE or WebSockets?

The question isn't which is "better." It's which matches your data flow:

CriterionSSEWebSockets
DirectionServer → client onlyBidirectional
ProtocolStandard HTTP (no upgrade)TCP upgrade to ws:// or wss://
ReconnectionAutomatic via EventSourceYou implement retry logic
Proxy/CDN supportWorks everywhereNeeds explicit proxy support
AuthStandard HTTP headersAuth in query param or first message (no headers on upgrade)
HTTP/2 multiplexingYes, multiple SSE streams over one TCP connectionNo, each WebSocket is a separate connection
ComplexityLow: standard HTTP semanticsHigher: connection state, heartbeats, reconnection
Voice barge-inNot possibleNative
Token streamingYesYes

Use SSE when:

  • You're streaming LLM tokens to a chat UI
  • You're pushing notifications, status updates, or analytics events
  • You want to stream agent monitoring events to a dashboard
  • The client sends a request and waits for a streamed response, no events mid-stream

Use WebSockets when:

  • You need barge-in / interruption detection
  • You're streaming raw audio bidirectionally
  • You're building collaborative real-time features where multiple participants send and receive
  • The client needs to send events (not just messages) during an active server stream

For most AI chat products, SSE is the right choice. WebSockets add real complexity: you own reconnection, heartbeat management, and connection state. Don't reach for WebSockets because they feel more "real-time." Reach for them when you genuinely need the client to push events mid-stream.

Building the streaming pipeline correctly

Choosing the right transport is one decision. Actually building the pipeline to use streaming throughout is the harder problem. A common mistake: teams add SSE at the application layer but leave batch processing inside the pipeline stages.

Here's what "streaming throughout" means in practice for a voice agent:

typescript
async function processVoiceTurn(
  audioChunks: AsyncIterable<Buffer>,
  agentId: string,
  ws: WebSocket
): Promise<void> {
  // Stage 1: Stream audio to STT — don't wait for full utterance
  const transcriptStream = await stt.streamTranscribe(audioChunks);
 
  // Stage 2: Start LLM as soon as we have enough context — don't wait for full transcript
  const partialTranscripts: string[] = [];
  let llmStream: AsyncIterable<LLMChunk> | null = null;
 
  for await (const transcript of transcriptStream) {
    partialTranscripts.push(transcript.text);
 
    // Fire LLM on end-of-utterance signal, not end-of-transcript
    if (transcript.isFinal && !llmStream) {
      const fullTranscript = partialTranscripts.join(" ");
      llmStream = getLLMStream(agentId, fullTranscript);
 
      // Stage 3: Start TTS on first LLM token — don't wait for full response
      processLLMToTTS(llmStream, ws).catch(console.error);
    }
  }
}
 
async function processLLMToTTS(
  llmStream: AsyncIterable<LLMChunk>,
  ws: WebSocket
): Promise<void> {
  const ttsStream = tts.createStream();
 
  // Feed LLM tokens into TTS as they arrive
  for await (const chunk of llmStream) {
    if (chunk.type === "token") {
      ttsStream.write(chunk.content);
    }
 
    if (chunk.type === "done") {
      ttsStream.end();
    }
  }
 
  // Forward TTS audio chunks to client as they synthesize
  for await (const audioChunk of ttsStream) {
    ws.send(JSON.stringify({ type: "audio", data: audioChunk.toString("base64") }));
  }
}

The key design choice: processLLMToTTS runs concurrently with the transcript loop, not after it. The for await on transcriptStream and the processLLMToTTS call run in parallel because we await the LLM stream setup and then .catch() the downstream chain. We don't await it inline. This is what creates the pipeline overlap.

What actually controls perceived latency

After building and optimizing voice pipelines across different architectures, the dominant factors are:

1. Time to first audio chunk matters more than total generation time. Users experience the gap between finishing their sentence and hearing the first syllable of the response. If that gap is under 400ms, the conversation feels alive. If it's over 800ms, it feels broken, even if the full response arrives 2 seconds later.

2. Pipeline parallelism delivers more than model optimization. Switching from GPT-4o to a faster model might save 30ms on first-token latency. Implementing proper streaming throughout the pipeline typically saves 400-800ms total. Optimize the architecture before optimizing the model selection.

3. Cold starts are the enemy of consistent latency. A system that achieves 280ms P50 but has 2,000ms P99 due to cold containers will feel slow. Maintain warm capacity, implement predictive scaling, and route user-facing traffic away from cold instances. You can see exactly where this is happening with proper agent monitoring.

4. Network topology matters. A 40ms round trip from your user to your server, before any processing, is 40ms you can't recover elsewhere. Edge deployment (6-8 geographic regions rather than one central data center) directly lowers the floor for every request.

5. TTS provider selection has outsized impact. The difference between a TTS provider with 250ms time-to-first-audio-chunk and one with 60ms is larger than the entire first-token latency budget for a fast LLM. Cartesia Sonic Turbo achieves ~40ms TTFB; ElevenLabs Flash is around 75ms. OpenAI's TTS API runs 120-180ms. That gap matters.

Where SSE fits in voice agent monitoring

One underappreciated use of SSE in voice AI systems is operational monitoring: pushing real-time events from your backend to a dashboard as calls happen. This isn't the real-time audio path (which uses WebSockets or WebRTC), but the observability layer sitting alongside it.

When an agent uses a tool, scores poorly on a quality evaluation, or hits an error, you want that signal to surface immediately, not in the next batch report. An SSE stream from your monitoring backend to your dashboard delivers those events without the complexity of a full WebSocket infrastructure, because the dashboard only needs to receive events, not send them.

typescript
// Monitoring SSE endpoint — pushes agent events as they happen
app.get("/api/monitoring/stream/:workspaceId", (req, res) => {
  const { workspaceId } = req.params;
 
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  // Subscribe to events for this workspace
  const unsubscribe = eventBus.subscribe(workspaceId, (event) => {
    res.write(
      `data: ${JSON.stringify({
        type: event.type, // "tool_call", "score_update", "escalation"
        agentId: event.agentId,
        callId: event.callId,
        payload: event.payload,
        ts: event.timestamp,
      })}\n\n`
    );
  });
 
  req.on("close", unsubscribe);
});

This pattern, SSE for monitoring dashboards and WebSockets for real-time audio, is the common production architecture. Each transport does one thing well.

The production checklist

Before shipping a streaming voice AI system, verify each layer:

Transport layer

  • SSE endpoints have proxy_buffering off in Nginx (or equivalent in your proxy)
  • gzip disabled on streaming routes
  • X-Accel-Buffering: no set as response header
  • Timeout configuration reviewed at every hop (proxy, ALB, CDN, Node.js)
  • Heartbeats enabled for connections with long-running tool calls

Pipeline

  • STT streaming enabled, not batch transcription
  • LLM receives partial transcripts (or at least fires on end-of-utterance, not end-of-audio-file)
  • TTS starts synthesizing on first LLM token, not after full response
  • Backpressure handling: drain event on res.write() returning false

Reliability

  • SSE events tagged with sequence numbers for resume-on-reconnect
  • WebSocket reconnection logic implemented with exponential backoff
  • Abort controllers cleaned up on disconnect (to avoid orphaned LLM calls)
  • Warm capacity maintained, no cold starts on P95 user-facing traffic

Observability

  • Time-to-first-token (TTFT) tracked per request
  • P50, P95, P99 latency tracked separately per pipeline stage
  • Error rates on SSE vs WebSocket connections tracked independently
  • Tool call latency tracked within stream events

The choice between SSE and WebSockets resolves cleanly once you're clear about data flow direction. For most AI chat and monitoring use cases, SSE is the right answer. It's simpler, works everywhere, and reconnects automatically. For voice with barge-in, audio streaming, or real-time multi-participant scenarios, WebSockets are necessary.

The harder work is building the pipeline to actually stream throughout, not just at the HTTP layer, but between every stage from STT to LLM to TTS. That's where the 400ms savings live. The transport protocol is the last few milliseconds. Pipeline architecture is the first few hundred.

Sources

LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions