When should I use SSE instead of WebSockets for a voice AI app?

Use SSE when communication is one-directional. Your server streams tokens or audio chunks to a client that only needs to receive. It works over standard HTTP, reconnects automatically, and passes through proxies without special configuration. Switch to WebSockets the moment you need the client to send events mid-stream, such as barge-in signals, real-time audio chunks, or cancellation requests.

Why does my SSE stream arrive in batches instead of token by token?

Almost certainly a proxy buffering problem. Nginx, Cloudflare, and AWS ALBs all buffer responses by default. You need proxy_buffering off in Nginx, X-Accel-Buffering: no as a response header, and gzip disabled on streaming endpoints. Any one of these misconfigured turns smooth per-token streaming into chunky 500ms bursts.

What is pipeline parallelism in a voice AI context?

Instead of waiting for each stage to finish before starting the next, you stream data between stages concurrently. For example, you start passing partial transcripts to the LLM while speech-to-text is still running, and begin feeding early LLM tokens to TTS synthesis while generation continues. Done right, this alone cuts 40-60% off total perceived latency.

How do I handle barge-in (interruption) with SSE?

You can't. Not cleanly. SSE is server-to-client only. When a user interrupts mid-response, your agent needs to stop generating and stop speaking immediately. That cancellation signal must travel client-to-server, which SSE doesn't support. You'll need either a separate HTTP POST to a cancel endpoint, or switch to WebSockets for the real-time audio channel where barge-in is expected.

What causes the 300ms response-time cliff in voice AI?

Human turn-taking rhythm averages 200-250ms. When an AI response exceeds 300ms, the listener's brain switches from conversational mode to waiting mode, a perceptual shift that degrades the experience even when accuracy is identical. Under 300ms feels like a conversation; over 500ms feels like a search engine.

Do I need WebRTC for a voice AI agent?

For user-facing voice, almost always yes. WebRTC uses UDP for transport, which handles packet loss gracefully rather than retransmitting stale audio, and includes built-in echo cancellation and noise suppression. The typical production pattern is WebRTC from browser to your media server, then WebSockets from the media server to STT/LLM/TTS APIs.

How do SSE reconnections affect mid-response state?

The browser's EventSource reconnects automatically using the Last-Event-ID header, but by default restarts the stream from scratch. For chat you want to resume, not replay. Tag each SSE event with a sequence number, store the message ID server-side, and on reconnect, use Last-Event-ID to skip already-delivered events.

What latency does each stage of a voice pipeline actually consume?

Streaming STT takes 80-120ms, LLM first-token latency is 100-150ms with modern models, TTS time-to-first-audio-chunk is 60-100ms with fast providers like Cartesia or ElevenLabs Flash, and network adds 20-50ms. The naive sum is 260-420ms, before any optimization. Pipeline parallelism cuts this because stages overlap rather than wait for each other.

SSE vs WebSockets for Voice AI: Choosing the Right Real-Time Transport

Voice AI latency is not just a performance metric. It's a product threshold. When your agent takes longer than 300ms to begin responding, the caller's brain switches from conversation mode to waiting mode. That perceptual shift degrades the interaction even when accuracy is identical. The architecture that determines whether you land on the right side of that threshold is built on two streaming primitives: Server-Sent Events (SSE) and WebSockets. Choosing the wrong one doesn't just add latency. It creates architectural constraints you'll fight for months.

This article explains what each transport actually does, where each one breaks down in real voice scenarios, and how pipeline parallelism makes a bigger difference than raw model speed.

Why streaming transports matter for voice AI

The core insight is that batch processing and real-time conversation are fundamentally incompatible. Traditional request-response architecture (wait for the full question, process it entirely, generate a complete answer, return it all at once) produces unacceptable latency for voice, because each stage must finish before the next can begin. Streaming transports exist to pipeline these stages so they overlap instead of queue.

Here's how the latency math works without streaming:

Stage	Naive duration
Speech-to-text (batch)	200-400ms
LLM generation (wait for full response)	1-4 seconds
Text-to-speech (full response synthesis)	500ms-1.5s
Network round trips	40-100ms
Total	~2-6 seconds

And with pipeline parallelism through streaming:

Stage	Streaming duration
STT (streaming, start on partial audio)	80-120ms to first transcript
LLM (streaming, first token latency)	100-150ms
TTS (streaming, first audio chunk)	60-100ms
Network (concurrent, not sequential)	20-50ms overhead
Total perceived	260-420ms

The stages still take the same total time. Streaming doesn't make the models faster. What it does is overlap them. The user starts hearing audio before the LLM has finished generating the full response, because TTS is synthesizing the first few sentences while the LLM is still working on the rest. That's pipeline parallelism, and it's where the 40-60% latency reduction comes from.

Pipeline parallelism: stages overlap instead of waiting for each other

SSE: the right tool for server-to-client streaming

Server-Sent Events are a browser-native protocol for one-directional streaming over standard HTTP. The server opens a persistent connection and pushes events as they're generated. The client listens. No bidirectional channel, no protocol upgrade. Just HTTP with Content-Type: text/event-stream and a persistent keep-alive.

For AI text streaming, where the model generates tokens and you want them to appear progressively in the UI, SSE is the default choice. It's what OpenAI, Anthropic, and nearly every AI API uses internally. It works over HTTP/2 (which multiplexes connections), passes through standard reverse proxies, and the browser's EventSource API reconnects automatically on drop.

Here's a minimal SSE server that streams from an LLM:

typescript

import express from "express";
 
const app = express();
app.use(express.json());
 
app.post("/api/chat/stream", async (req, res) => {
  const { agentId, messages } = req.body;
 
  // These three headers establish the SSE connection
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  // Critical for Nginx — without this, responses buffer until gzip threshold
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  let seq = 0;
 
  try {
    // Stream from your LLM provider
    const stream = await getLLMStream(agentId, messages);
 
    for await (const chunk of stream) {
      if (chunk.type === "token") {
        // SSE format: "id: N\ndata: <payload>\n\n"
        // The id enables reconnection resume via Last-Event-ID
        res.write(
          `id: ${++seq}\ndata: ${JSON.stringify({
            type: "token",
            content: chunk.content,
            seq,
          })}\n\n`
        );
 
        // Handle backpressure — if the client can't keep up, pause the stream
        if (!res.write("")) {
          await new Promise<void>((resolve) => res.once("drain", resolve));
        }
      }
 
      if (chunk.type === "done") {
        res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
      }
    }
  } catch (error) {
    res.write(
      `data: ${JSON.stringify({
        type: "error",
        message: error instanceof Error ? error.message : "Unknown error",
      })}\n\n`
    );
  }
 
  res.end();
});

The buffering traps that silently break SSE

This is where most implementations go wrong. SSE works perfectly in local development, then tokens arrive in bursts when deployed behind a proxy. The culprit is almost always response buffering at one of these layers:

nginx

# Nginx SSE configuration — every directive here matters
location /api/chat/stream {
    proxy_pass http://backend:3000;
 
    # Disable buffering — without this, Nginx holds chunks until its buffer fills
    proxy_buffering off;
    proxy_cache off;
 
    # HTTP/1.1 keepalive for persistent connection
    proxy_http_version 1.1;
    proxy_set_header Connection '';
 
    # Extend timeouts for long-running streams and tool calls
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
 
    # Gzip buffers until it has enough data to compress — kills streaming
    gzip off;
}

The common failure modes and their fixes:

Problem	Cause	Fix
Tokens arrive in 500ms batches	`proxy_buffering on` (Nginx default)	`proxy_buffering off`
Smooth locally, batchy in prod	Gzip compression buffering	`gzip off` on streaming endpoints
Stream dies after 100s of silence	Cloudflare idle timeout	Send `": keepalive\n\n"` every 30s
Tokens burst after tool calls	ALB 60s idle timeout	Increase timeout or send heartbeats
Full response arrives at once	CDN response caching	`Cache-Control: no-cache` header

For Cloudflare specifically, it terminates connections that go silent for 100 seconds, which matters for voice agents running tool calls. Send SSE comment heartbeats during tool execution:

typescript

// Keep Cloudflare alive while a tool call is running
const heartbeat = setInterval(() => {
  if (!res.writableEnded) {
    res.write(": keepalive\n\n");
  }
}, 30_000);
 
try {
  // ... stream tokens, execute tools, etc.
} finally {
  clearInterval(heartbeat);
}

Consuming SSE in the browser

The built-in EventSource only handles GET. For POST (required when you need to send message history or auth headers), use fetch with a streaming body reader:

typescript

async function streamChat(
  agentId: string,
  messages: Array<{ role: string; content: string }>,
  onToken: (content: string) => void
) {
  const controller = new AbortController();
  const start = Date.now();
  let ttft: number | null = null;
 
  const response = await fetch("/api/chat/stream", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${getToken()}`,
    },
    body: JSON.stringify({ agentId, messages }),
    signal: controller.signal,
  });
 
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    buffer += decoder.decode(value, { stream: true });
 
    // SSE lines are separated by double newlines
    const lines = buffer.split("\n");
    buffer = lines.pop()!; // Hold incomplete line in buffer
 
    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      if (line === "data: [DONE]") return;
 
      const data = JSON.parse(line.slice(6));
 
      if (data.type === "token") {
        if (!ttft) {
          ttft = Date.now() - start;
          console.log(`Time to first token: ${ttft}ms`);
        }
        onToken(data.content);
      }
    }
  }
 
  // Expose cancel for the UI
  return () => controller.abort();
}

WebSockets: when you need bidirectional real-time

WebSockets open a persistent, full-duplex TCP connection where either side can send messages at any time. The upgrade from HTTP happens once on connect, then all subsequent messages travel over the same persistent socket.

For voice AI, the cases that require WebSockets are specific:

Barge-in / interruption handling. The user starts talking while the agent is still speaking. Your system needs to simultaneously receive that audio, stop TTS playback, cancel the in-flight LLM generation, and re-route to STT, all triggered by a client event that arrives while the server is actively streaming audio back. SSE can't handle this because the client has no channel to send the interruption signal.
Continuous audio streaming. Sending microphone audio in real time requires continuous client-to-server transmission. SSE is server-to-client only.
Multi-turn coordination. Some architectures need the client to send semantic events mid-stream, signaling that a user nodded, confirming a detected intent, or injecting tool results from the client side.

WebSocket voice: barge-in signal travels opposite direction to the audio stream

Here's a WebSocket server that handles barge-in:

typescript

import { WebSocketServer, WebSocket } from "ws";
 
const wss = new WebSocketServer({ port: 8080 });
 
wss.on("connection", (ws: WebSocket) => {
  let activeController: AbortController | null = null;
  let isStreaming = false;
 
  ws.on("message", async (raw: Buffer) => {
    const message = JSON.parse(raw.toString());
 
    if (message.type === "barge_in") {
      // User started speaking — immediately cancel ongoing generation
      if (activeController) {
        activeController.abort();
        activeController = null;
        isStreaming = false;
      }
      ws.send(JSON.stringify({ type: "listening" }));
      return;
    }
 
    if (message.type === "audio_chunk") {
      // Route audio to your STT provider (Deepgram, AssemblyAI, etc.)
      await forwardToSTT(message.data);
      return;
    }
 
    if (message.type === "transcript") {
      // Full transcript ready — generate and stream response
      activeController = new AbortController();
      isStreaming = true;
 
      try {
        const stream = await getLLMStream(
          message.agentId,
          message.transcript,
          { signal: activeController.signal }
        );
 
        for await (const chunk of stream) {
          if (!isStreaming) break; // Barge-in may have cleared this flag
 
          if (chunk.type === "token") {
            // Feed tokens to TTS, then stream audio back
            const audioChunk = await synthesize(chunk.content);
            ws.send(
              JSON.stringify({
                type: "audio",
                data: audioChunk,
              })
            );
          }
        }
 
        if (isStreaming) {
          ws.send(JSON.stringify({ type: "done" }));
        }
      } catch (err: any) {
        if (err.name === "AbortError") return; // Expected on barge-in
        ws.send(JSON.stringify({ type: "error", message: err.message }));
      } finally {
        activeController = null;
        isStreaming = false;
      }
    }
  });
 
  ws.on("close", () => {
    activeController?.abort();
  });
});

The decision framework: SSE or WebSockets?

The question isn't which is "better." It's which matches your data flow:

Criterion	SSE	WebSockets
Direction	Server → client only	Bidirectional
Protocol	Standard HTTP (no upgrade)	TCP upgrade to `ws://` or `wss://`
Reconnection	Automatic via `EventSource`	You implement retry logic
Proxy/CDN support	Works everywhere	Needs explicit proxy support
Auth	Standard HTTP headers	Auth in query param or first message (no headers on upgrade)
HTTP/2 multiplexing	Yes, multiple SSE streams over one TCP connection	No, each WebSocket is a separate connection
Complexity	Low: standard HTTP semantics	Higher: connection state, heartbeats, reconnection
Voice barge-in	Not possible	Native
Token streaming	Yes	Yes

Use SSE when:

You're streaming LLM tokens to a chat UI
You're pushing notifications, status updates, or analytics events
You want to stream agent monitoring events to a dashboard
The client sends a request and waits for a streamed response, no events mid-stream

Use WebSockets when:

You need barge-in / interruption detection
You're streaming raw audio bidirectionally
You're building collaborative real-time features where multiple participants send and receive
The client needs to send events (not just messages) during an active server stream

For most AI chat products, SSE is the right choice. WebSockets add real complexity: you own reconnection, heartbeat management, and connection state. Don't reach for WebSockets because they feel more "real-time." Reach for them when you genuinely need the client to push events mid-stream.

Building the streaming pipeline correctly

Choosing the right transport is one decision. Actually building the pipeline to use streaming throughout is the harder problem. A common mistake: teams add SSE at the application layer but leave batch processing inside the pipeline stages.

Here's what "streaming throughout" means in practice for a voice agent:

typescript

async function processVoiceTurn(
  audioChunks: AsyncIterable<Buffer>,
  agentId: string,
  ws: WebSocket
): Promise<void> {
  // Stage 1: Stream audio to STT — don't wait for full utterance
  const transcriptStream = await stt.streamTranscribe(audioChunks);
 
  // Stage 2: Start LLM as soon as we have enough context — don't wait for full transcript
  const partialTranscripts: string[] = [];
  let llmStream: AsyncIterable<LLMChunk> | null = null;
 
  for await (const transcript of transcriptStream) {
    partialTranscripts.push(transcript.text);
 
    // Fire LLM on end-of-utterance signal, not end-of-transcript
    if (transcript.isFinal && !llmStream) {
      const fullTranscript = partialTranscripts.join(" ");
      llmStream = getLLMStream(agentId, fullTranscript);
 
      // Stage 3: Start TTS on first LLM token — don't wait for full response
      processLLMToTTS(llmStream, ws).catch(console.error);
    }
  }
}
 
async function processLLMToTTS(
  llmStream: AsyncIterable<LLMChunk>,
  ws: WebSocket
): Promise<void> {
  const ttsStream = tts.createStream();
 
  // Feed LLM tokens into TTS as they arrive
  for await (const chunk of llmStream) {
    if (chunk.type === "token") {
      ttsStream.write(chunk.content);
    }
 
    if (chunk.type === "done") {
      ttsStream.end();
    }
  }
 
  // Forward TTS audio chunks to client as they synthesize
  for await (const audioChunk of ttsStream) {
    ws.send(JSON.stringify({ type: "audio", data: audioChunk.toString("base64") }));
  }
}

The key design choice: processLLMToTTS runs concurrently with the transcript loop, not after it. The for await on transcriptStream and the processLLMToTTS call run in parallel because we await the LLM stream setup and then .catch() the downstream chain. We don't await it inline. This is what creates the pipeline overlap.

What actually controls perceived latency

After building and optimizing voice pipelines across different architectures, the dominant factors are:

1. Time to first audio chunk matters more than total generation time. Users experience the gap between finishing their sentence and hearing the first syllable of the response. If that gap is under 400ms, the conversation feels alive. If it's over 800ms, it feels broken, even if the full response arrives 2 seconds later.

2. Pipeline parallelism delivers more than model optimization. Switching from GPT-4o to a faster model might save 30ms on first-token latency. Implementing proper streaming throughout the pipeline typically saves 400-800ms total. Optimize the architecture before optimizing the model selection.

3. Cold starts are the enemy of consistent latency. A system that achieves 280ms P50 but has 2,000ms P99 due to cold containers will feel slow. Maintain warm capacity, implement predictive scaling, and route user-facing traffic away from cold instances. You can see exactly where this is happening with proper agent monitoring.

4. Network topology matters. A 40ms round trip from your user to your server, before any processing, is 40ms you can't recover elsewhere. Edge deployment (6-8 geographic regions rather than one central data center) directly lowers the floor for every request.

5. TTS provider selection has outsized impact. The difference between a TTS provider with 250ms time-to-first-audio-chunk and one with 60ms is larger than the entire first-token latency budget for a fast LLM. Cartesia Sonic Turbo achieves ~40ms TTFB; ElevenLabs Flash is around 75ms. OpenAI's TTS API runs 120-180ms. That gap matters.

Where SSE fits in voice agent monitoring

One underappreciated use of SSE in voice AI systems is operational monitoring: pushing real-time events from your backend to a dashboard as calls happen. This isn't the real-time audio path (which uses WebSockets or WebRTC), but the observability layer sitting alongside it.

When an agent uses a tool, scores poorly on a quality evaluation, or hits an error, you want that signal to surface immediately, not in the next batch report. An SSE stream from your monitoring backend to your dashboard delivers those events without the complexity of a full WebSocket infrastructure, because the dashboard only needs to receive events, not send them.

typescript

// Monitoring SSE endpoint — pushes agent events as they happen
app.get("/api/monitoring/stream/:workspaceId", (req, res) => {
  const { workspaceId } = req.params;
 
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  // Subscribe to events for this workspace
  const unsubscribe = eventBus.subscribe(workspaceId, (event) => {
    res.write(
      `data: ${JSON.stringify({
        type: event.type, // "tool_call", "score_update", "escalation"
        agentId: event.agentId,
        callId: event.callId,
        payload: event.payload,
        ts: event.timestamp,
      })}\n\n`
    );
  });
 
  req.on("close", unsubscribe);
});

This pattern, SSE for monitoring dashboards and WebSockets for real-time audio, is the common production architecture. Each transport does one thing well.

Get Started

The production checklist

Before shipping a streaming voice AI system, verify each layer:

Transport layer

SSE endpoints have proxy_buffering off in Nginx (or equivalent in your proxy)
gzip disabled on streaming routes
X-Accel-Buffering: no set as response header
Timeout configuration reviewed at every hop (proxy, ALB, CDN, Node.js)
Heartbeats enabled for connections with long-running tool calls

Pipeline

STT streaming enabled, not batch transcription
LLM receives partial transcripts (or at least fires on end-of-utterance, not end-of-audio-file)
TTS starts synthesizing on first LLM token, not after full response
Backpressure handling: drain event on res.write() returning false

Reliability

SSE events tagged with sequence numbers for resume-on-reconnect
WebSocket reconnection logic implemented with exponential backoff
Abort controllers cleaned up on disconnect (to avoid orphaned LLM calls)
Warm capacity maintained, no cold starts on P95 user-facing traffic

Observability

Time-to-first-token (TTFT) tracked per request
P50, P95, P99 latency tracked separately per pipeline stage
Error rates on SSE vs WebSocket connections tracked independently
Tool call latency tracked within stream events

The choice between SSE and WebSockets resolves cleanly once you're clear about data flow direction. For most AI chat and monitoring use cases, SSE is the right answer. It's simpler, works everywhere, and reconnects automatically. For voice with barge-in, audio streaming, or real-time multi-participant scenarios, WebSockets are necessary.

The harder work is building the pipeline to actually stream throughout, not just at the HTTP layer, but between every stage from STT to LLM to TTS. That's where the 400ms savings live. The transport protocol is the last few milliseconds. Pipeline architecture is the first few hundred.

Sources

MDN Web Docs: Server-Sent Events. Browser EventSource API reference, reconnection behavior, and SSE event format specification.
WHATWG: HTML Living Standard, Server-Sent Events. The SSE protocol specification, including Last-Event-ID reconnection semantics.
RFC 6455: The WebSocket Protocol. Full WebSocket specification covering the upgrade handshake, framing, and connection lifecycle.
Nginx Documentation: ngx_http_proxy_module, proxy_buffering. Configuration reference for the directives that make or break SSE passthrough.
Node.js: Backpressuring in Streams. Official guide to drain event handling and writable stream backpressure.
Deepgram: Streaming Speech Recognition. Real-time STT API reference including partial transcript events and endpointing configuration.
Anthropic API: Messages Streaming. Claude streaming event types: content_block_delta, message_delta, message_stop.
Cloudflare: Timeouts. Documented proxy timeout limits including the 100-second idle timeout that kills silent SSE streams.

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

voice-ai latency streaming sse websockets real-time architecture agent-architecture

Lucas Dalamarta

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.