Voice AI latency is not just a performance metric. It's a product threshold. When your agent takes longer than 300ms to begin responding, the caller's brain switches from conversation mode to waiting mode. That perceptual shift degrades the interaction even when accuracy is identical. The architecture that determines whether you land on the right side of that threshold is built on two streaming primitives: Server-Sent Events (SSE) and WebSockets. Choosing the wrong one doesn't just add latency. It creates architectural constraints you'll fight for months.
This article explains what each transport actually does, where each one breaks down in real voice scenarios, and how pipeline parallelism makes a bigger difference than raw model speed.
Why streaming transports matter for voice AI
The core insight is that batch processing and real-time conversation are fundamentally incompatible. Traditional request-response architecture (wait for the full question, process it entirely, generate a complete answer, return it all at once) produces unacceptable latency for voice, because each stage must finish before the next can begin. Streaming transports exist to pipeline these stages so they overlap instead of queue.
Here's how the latency math works without streaming:
| Stage | Naive duration |
|---|---|
| Speech-to-text (batch) | 200-400ms |
| LLM generation (wait for full response) | 1-4 seconds |
| Text-to-speech (full response synthesis) | 500ms-1.5s |
| Network round trips | 40-100ms |
| Total | ~2-6 seconds |
And with pipeline parallelism through streaming:
| Stage | Streaming duration |
|---|---|
| STT (streaming, start on partial audio) | 80-120ms to first transcript |
| LLM (streaming, first token latency) | 100-150ms |
| TTS (streaming, first audio chunk) | 60-100ms |
| Network (concurrent, not sequential) | 20-50ms overhead |
| Total perceived | 260-420ms |
The stages still take the same total time. Streaming doesn't make the models faster. What it does is overlap them. The user starts hearing audio before the LLM has finished generating the full response, because TTS is synthesizing the first few sentences while the LLM is still working on the rest. That's pipeline parallelism, and it's where the 40-60% latency reduction comes from.
SSE: the right tool for server-to-client streaming
Server-Sent Events are a browser-native protocol for one-directional streaming over standard HTTP. The server opens a persistent connection and pushes events as they're generated. The client listens. No bidirectional channel, no protocol upgrade. Just HTTP with Content-Type: text/event-stream and a persistent keep-alive.
For AI text streaming, where the model generates tokens and you want them to appear progressively in the UI, SSE is the default choice. It's what OpenAI, Anthropic, and nearly every AI API uses internally. It works over HTTP/2 (which multiplexes connections), passes through standard reverse proxies, and the browser's EventSource API reconnects automatically on drop.
Here's a minimal SSE server that streams from an LLM:
import express from "express";
const app = express();
app.use(express.json());
app.post("/api/chat/stream", async (req, res) => {
const { agentId, messages } = req.body;
// These three headers establish the SSE connection
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
// Critical for Nginx — without this, responses buffer until gzip threshold
res.setHeader("X-Accel-Buffering", "no");
res.flushHeaders();
let seq = 0;
try {
// Stream from your LLM provider
const stream = await getLLMStream(agentId, messages);
for await (const chunk of stream) {
if (chunk.type === "token") {
// SSE format: "id: N\ndata: <payload>\n\n"
// The id enables reconnection resume via Last-Event-ID
res.write(
`id: ${++seq}\ndata: ${JSON.stringify({
type: "token",
content: chunk.content,
seq,
})}\n\n`
);
// Handle backpressure — if the client can't keep up, pause the stream
if (!res.write("")) {
await new Promise<void>((resolve) => res.once("drain", resolve));
}
}
if (chunk.type === "done") {
res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
}
}
} catch (error) {
res.write(
`data: ${JSON.stringify({
type: "error",
message: error instanceof Error ? error.message : "Unknown error",
})}\n\n`
);
}
res.end();
});The buffering traps that silently break SSE
This is where most implementations go wrong. SSE works perfectly in local development, then tokens arrive in bursts when deployed behind a proxy. The culprit is almost always response buffering at one of these layers:
# Nginx SSE configuration — every directive here matters
location /api/chat/stream {
proxy_pass http://backend:3000;
# Disable buffering — without this, Nginx holds chunks until its buffer fills
proxy_buffering off;
proxy_cache off;
# HTTP/1.1 keepalive for persistent connection
proxy_http_version 1.1;
proxy_set_header Connection '';
# Extend timeouts for long-running streams and tool calls
proxy_read_timeout 300s;
proxy_send_timeout 300s;
# Gzip buffers until it has enough data to compress — kills streaming
gzip off;
}The common failure modes and their fixes:
| Problem | Cause | Fix |
|---|---|---|
| Tokens arrive in 500ms batches | proxy_buffering on (Nginx default) | proxy_buffering off |
| Smooth locally, batchy in prod | Gzip compression buffering | gzip off on streaming endpoints |
| Stream dies after 100s of silence | Cloudflare idle timeout | Send ": keepalive\n\n" every 30s |
| Tokens burst after tool calls | ALB 60s idle timeout | Increase timeout or send heartbeats |
| Full response arrives at once | CDN response caching | Cache-Control: no-cache header |
For Cloudflare specifically, it terminates connections that go silent for 100 seconds, which matters for voice agents running tool calls. Send SSE comment heartbeats during tool execution:
// Keep Cloudflare alive while a tool call is running
const heartbeat = setInterval(() => {
if (!res.writableEnded) {
res.write(": keepalive\n\n");
}
}, 30_000);
try {
// ... stream tokens, execute tools, etc.
} finally {
clearInterval(heartbeat);
}Consuming SSE in the browser
The built-in EventSource only handles GET. For POST (required when you need to send message history or auth headers), use fetch with a streaming body reader:
async function streamChat(
agentId: string,
messages: Array<{ role: string; content: string }>,
onToken: (content: string) => void
) {
const controller = new AbortController();
const start = Date.now();
let ttft: number | null = null;
const response = await fetch("/api/chat/stream", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${getToken()}`,
},
body: JSON.stringify({ agentId, messages }),
signal: controller.signal,
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// SSE lines are separated by double newlines
const lines = buffer.split("\n");
buffer = lines.pop()!; // Hold incomplete line in buffer
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
if (line === "data: [DONE]") return;
const data = JSON.parse(line.slice(6));
if (data.type === "token") {
if (!ttft) {
ttft = Date.now() - start;
console.log(`Time to first token: ${ttft}ms`);
}
onToken(data.content);
}
}
}
// Expose cancel for the UI
return () => controller.abort();
}WebSockets: when you need bidirectional real-time
WebSockets open a persistent, full-duplex TCP connection where either side can send messages at any time. The upgrade from HTTP happens once on connect, then all subsequent messages travel over the same persistent socket.
For voice AI, the cases that require WebSockets are specific:
-
Barge-in / interruption handling. The user starts talking while the agent is still speaking. Your system needs to simultaneously receive that audio, stop TTS playback, cancel the in-flight LLM generation, and re-route to STT, all triggered by a client event that arrives while the server is actively streaming audio back. SSE can't handle this because the client has no channel to send the interruption signal.
-
Continuous audio streaming. Sending microphone audio in real time requires continuous client-to-server transmission. SSE is server-to-client only.
-
Multi-turn coordination. Some architectures need the client to send semantic events mid-stream, signaling that a user nodded, confirming a detected intent, or injecting tool results from the client side.
Here's a WebSocket server that handles barge-in:
import { WebSocketServer, WebSocket } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (ws: WebSocket) => {
let activeController: AbortController | null = null;
let isStreaming = false;
ws.on("message", async (raw: Buffer) => {
const message = JSON.parse(raw.toString());
if (message.type === "barge_in") {
// User started speaking — immediately cancel ongoing generation
if (activeController) {
activeController.abort();
activeController = null;
isStreaming = false;
}
ws.send(JSON.stringify({ type: "listening" }));
return;
}
if (message.type === "audio_chunk") {
// Route audio to your STT provider (Deepgram, AssemblyAI, etc.)
await forwardToSTT(message.data);
return;
}
if (message.type === "transcript") {
// Full transcript ready — generate and stream response
activeController = new AbortController();
isStreaming = true;
try {
const stream = await getLLMStream(
message.agentId,
message.transcript,
{ signal: activeController.signal }
);
for await (const chunk of stream) {
if (!isStreaming) break; // Barge-in may have cleared this flag
if (chunk.type === "token") {
// Feed tokens to TTS, then stream audio back
const audioChunk = await synthesize(chunk.content);
ws.send(
JSON.stringify({
type: "audio",
data: audioChunk,
})
);
}
}
if (isStreaming) {
ws.send(JSON.stringify({ type: "done" }));
}
} catch (err: any) {
if (err.name === "AbortError") return; // Expected on barge-in
ws.send(JSON.stringify({ type: "error", message: err.message }));
} finally {
activeController = null;
isStreaming = false;
}
}
});
ws.on("close", () => {
activeController?.abort();
});
});The decision framework: SSE or WebSockets?
The question isn't which is "better." It's which matches your data flow:
| Criterion | SSE | WebSockets |
|---|---|---|
| Direction | Server → client only | Bidirectional |
| Protocol | Standard HTTP (no upgrade) | TCP upgrade to ws:// or wss:// |
| Reconnection | Automatic via EventSource | You implement retry logic |
| Proxy/CDN support | Works everywhere | Needs explicit proxy support |
| Auth | Standard HTTP headers | Auth in query param or first message (no headers on upgrade) |
| HTTP/2 multiplexing | Yes, multiple SSE streams over one TCP connection | No, each WebSocket is a separate connection |
| Complexity | Low: standard HTTP semantics | Higher: connection state, heartbeats, reconnection |
| Voice barge-in | Not possible | Native |
| Token streaming | Yes | Yes |
Use SSE when:
- You're streaming LLM tokens to a chat UI
- You're pushing notifications, status updates, or analytics events
- You want to stream agent monitoring events to a dashboard
- The client sends a request and waits for a streamed response, no events mid-stream
Use WebSockets when:
- You need barge-in / interruption detection
- You're streaming raw audio bidirectionally
- You're building collaborative real-time features where multiple participants send and receive
- The client needs to send events (not just messages) during an active server stream
For most AI chat products, SSE is the right choice. WebSockets add real complexity: you own reconnection, heartbeat management, and connection state. Don't reach for WebSockets because they feel more "real-time." Reach for them when you genuinely need the client to push events mid-stream.
Building the streaming pipeline correctly
Choosing the right transport is one decision. Actually building the pipeline to use streaming throughout is the harder problem. A common mistake: teams add SSE at the application layer but leave batch processing inside the pipeline stages.
Here's what "streaming throughout" means in practice for a voice agent:
async function processVoiceTurn(
audioChunks: AsyncIterable<Buffer>,
agentId: string,
ws: WebSocket
): Promise<void> {
// Stage 1: Stream audio to STT — don't wait for full utterance
const transcriptStream = await stt.streamTranscribe(audioChunks);
// Stage 2: Start LLM as soon as we have enough context — don't wait for full transcript
const partialTranscripts: string[] = [];
let llmStream: AsyncIterable<LLMChunk> | null = null;
for await (const transcript of transcriptStream) {
partialTranscripts.push(transcript.text);
// Fire LLM on end-of-utterance signal, not end-of-transcript
if (transcript.isFinal && !llmStream) {
const fullTranscript = partialTranscripts.join(" ");
llmStream = getLLMStream(agentId, fullTranscript);
// Stage 3: Start TTS on first LLM token — don't wait for full response
processLLMToTTS(llmStream, ws).catch(console.error);
}
}
}
async function processLLMToTTS(
llmStream: AsyncIterable<LLMChunk>,
ws: WebSocket
): Promise<void> {
const ttsStream = tts.createStream();
// Feed LLM tokens into TTS as they arrive
for await (const chunk of llmStream) {
if (chunk.type === "token") {
ttsStream.write(chunk.content);
}
if (chunk.type === "done") {
ttsStream.end();
}
}
// Forward TTS audio chunks to client as they synthesize
for await (const audioChunk of ttsStream) {
ws.send(JSON.stringify({ type: "audio", data: audioChunk.toString("base64") }));
}
}The key design choice: processLLMToTTS runs concurrently with the transcript loop, not after it. The for await on transcriptStream and the processLLMToTTS call run in parallel because we await the LLM stream setup and then .catch() the downstream chain. We don't await it inline. This is what creates the pipeline overlap.
What actually controls perceived latency
After building and optimizing voice pipelines across different architectures, the dominant factors are:
1. Time to first audio chunk matters more than total generation time. Users experience the gap between finishing their sentence and hearing the first syllable of the response. If that gap is under 400ms, the conversation feels alive. If it's over 800ms, it feels broken, even if the full response arrives 2 seconds later.
2. Pipeline parallelism delivers more than model optimization. Switching from GPT-4o to a faster model might save 30ms on first-token latency. Implementing proper streaming throughout the pipeline typically saves 400-800ms total. Optimize the architecture before optimizing the model selection.
3. Cold starts are the enemy of consistent latency. A system that achieves 280ms P50 but has 2,000ms P99 due to cold containers will feel slow. Maintain warm capacity, implement predictive scaling, and route user-facing traffic away from cold instances. You can see exactly where this is happening with proper agent monitoring.
4. Network topology matters. A 40ms round trip from your user to your server, before any processing, is 40ms you can't recover elsewhere. Edge deployment (6-8 geographic regions rather than one central data center) directly lowers the floor for every request.
5. TTS provider selection has outsized impact. The difference between a TTS provider with 250ms time-to-first-audio-chunk and one with 60ms is larger than the entire first-token latency budget for a fast LLM. Cartesia Sonic Turbo achieves ~40ms TTFB; ElevenLabs Flash is around 75ms. OpenAI's TTS API runs 120-180ms. That gap matters.
Where SSE fits in voice agent monitoring
One underappreciated use of SSE in voice AI systems is operational monitoring: pushing real-time events from your backend to a dashboard as calls happen. This isn't the real-time audio path (which uses WebSockets or WebRTC), but the observability layer sitting alongside it.
When an agent uses a tool, scores poorly on a quality evaluation, or hits an error, you want that signal to surface immediately, not in the next batch report. An SSE stream from your monitoring backend to your dashboard delivers those events without the complexity of a full WebSocket infrastructure, because the dashboard only needs to receive events, not send them.
// Monitoring SSE endpoint — pushes agent events as they happen
app.get("/api/monitoring/stream/:workspaceId", (req, res) => {
const { workspaceId } = req.params;
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("X-Accel-Buffering", "no");
res.flushHeaders();
// Subscribe to events for this workspace
const unsubscribe = eventBus.subscribe(workspaceId, (event) => {
res.write(
`data: ${JSON.stringify({
type: event.type, // "tool_call", "score_update", "escalation"
agentId: event.agentId,
callId: event.callId,
payload: event.payload,
ts: event.timestamp,
})}\n\n`
);
});
req.on("close", unsubscribe);
});This pattern, SSE for monitoring dashboards and WebSockets for real-time audio, is the common production architecture. Each transport does one thing well.
The production checklist
Before shipping a streaming voice AI system, verify each layer:
Transport layer
- SSE endpoints have
proxy_buffering offin Nginx (or equivalent in your proxy) -
gzipdisabled on streaming routes -
X-Accel-Buffering: noset as response header - Timeout configuration reviewed at every hop (proxy, ALB, CDN, Node.js)
- Heartbeats enabled for connections with long-running tool calls
Pipeline
- STT streaming enabled, not batch transcription
- LLM receives partial transcripts (or at least fires on end-of-utterance, not end-of-audio-file)
- TTS starts synthesizing on first LLM token, not after full response
- Backpressure handling:
drainevent onres.write()returning false
Reliability
- SSE events tagged with sequence numbers for resume-on-reconnect
- WebSocket reconnection logic implemented with exponential backoff
- Abort controllers cleaned up on disconnect (to avoid orphaned LLM calls)
- Warm capacity maintained, no cold starts on P95 user-facing traffic
Observability
- Time-to-first-token (TTFT) tracked per request
- P50, P95, P99 latency tracked separately per pipeline stage
- Error rates on SSE vs WebSocket connections tracked independently
- Tool call latency tracked within stream events
The choice between SSE and WebSockets resolves cleanly once you're clear about data flow direction. For most AI chat and monitoring use cases, SSE is the right answer. It's simpler, works everywhere, and reconnects automatically. For voice with barge-in, audio streaming, or real-time multi-participant scenarios, WebSockets are necessary.
The harder work is building the pipeline to actually stream throughout, not just at the HTTP layer, but between every stage from STT to LLM to TTS. That's where the 400ms savings live. The transport protocol is the last few milliseconds. Pipeline architecture is the first few hundred.
Sources
- MDN Web Docs: Server-Sent Events. Browser EventSource API reference, reconnection behavior, and SSE event format specification.
- WHATWG: HTML Living Standard, Server-Sent Events. The SSE protocol specification, including
Last-Event-IDreconnection semantics. - RFC 6455: The WebSocket Protocol. Full WebSocket specification covering the upgrade handshake, framing, and connection lifecycle.
- Nginx Documentation: ngx_http_proxy_module, proxy_buffering. Configuration reference for the directives that make or break SSE passthrough.
- Node.js: Backpressuring in Streams. Official guide to
drainevent handling and writable stream backpressure. - Deepgram: Streaming Speech Recognition. Real-time STT API reference including partial transcript events and endpointing configuration.
- Anthropic API: Messages Streaming. Claude streaming event types:
content_block_delta,message_delta,message_stop. - Cloudflare: Timeouts. Documented proxy timeout limits including the 100-second idle timeout that kills silent SSE streams.
Engineering Lead
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



