Our voice agent took 1.4 seconds to respond. Users described it as "talking to someone on a bad phone connection." We traced the problem through every layer: 280ms in speech-to-text, 600ms waiting for the LLM to start generating, 320ms before the first audio reached the speaker, and 200ms of transport overhead nobody had measured. The fix wasn't faster models. It was understanding how the pipeline actually works and where every millisecond goes.
This article builds a real-time voice AI pipeline from scratch. We'll start with a single audio frame and follow it through speech recognition, language model inference, and speech synthesis. You'll see exactly where latency hides, how streaming eliminates the biggest bottleneck, and how to get the whole thing under 300ms.
| What you'll build | What you'll learn |
|---|---|
| Streaming voice pipeline | How STT, LLM, and TTS run concurrently through frames |
| Latency budget breakdown | Where every millisecond goes and how to measure it |
| Smart turn detection | VAD vs semantic end-of-turn and why it matters |
| Interruption handler | Stopping mid-sentence when the user cuts in |
| Transport layer | Why WebRTC saves 700ms over phone calls |
| Framework comparison | Pipecat vs LiveKit vs VAPI for different use cases |
What you'll need
Runtime:
- Python 3.11+ (Pipecat is Python-native)
- API keys for Deepgram (STT), OpenAI (LLM), and Cartesia or ElevenLabs (TTS)
Install dependencies:
pip install "pipecat-ai[deepgram,openai,cartesia,daily,silero]"Set your keys:
export DEEPGRAM_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export CARTESIA_API_KEY="your-key"What does a voice AI pipeline actually do?
A voice AI pipeline converts spoken audio to text, reasons about a response, and converts that response back to audio, all in real time. The entire round trip needs to happen in under 300ms to feel like a natural conversation. The trick is that these three stages don't run one after the other. They stream concurrently.
Here's the mental model. A human conversation has a natural rhythm. One person speaks, there's a brief pause (typically 200-300ms), and the other person responds. If that pause stretches to 600ms, the conversation feels sluggish. Past a full second, it feels broken. Your AI agent is held to the same standard.
The pipeline has three core stages and a transport layer on each end:
Without streaming, each stage waits for the previous one to finish completely. The user speaks for 3 seconds, STT processes the full utterance (280ms), the LLM generates the complete response (2-4 seconds), and TTS synthesizes all the audio (500ms). Total response time: 3+ seconds after the user stops talking. Nobody would use this.
With streaming, STT emits partial transcripts while the user is still speaking. The LLM starts generating tokens as soon as it has enough context. TTS converts each token chunk to audio the moment it arrives. The user hears the first word of the response within 300ms of finishing their sentence, while the rest of the response is still being generated.
That's the architecture. Let's build it.
How does frame-based streaming work?
Frame-based streaming treats audio and text as a continuous river of small typed objects called frames. Each processor in the pipeline consumes one frame type, does its work, and emits the next frame type downstream. Audio frames carry 20ms of PCM samples. Transcription frames carry text. LLM frames carry tokens. This design lets every stage run concurrently on its own async task.
Pipecat's frame model is the foundation of everything else. If you have built data pipelines with tools like Apache Kafka or Unix pipes, the concept is familiar: small messages flowing through a chain of processors.
Here are the core frame types you'll work with:
from pipecat.frames.frames import (
AudioRawFrame, # 20ms of PCM audio samples
TranscriptionFrame, # Final transcript from STT
InterimTranscriptionFrame, # Partial transcript (still speaking)
TextFrame, # LLM output token
TTSAudioRawFrame, # Synthesized audio from TTS
StartInterruptionFrame, # User started speaking (cancel current output)
UserStartedSpeakingFrame, # VAD detected speech onset
UserStoppedSpeakingFrame, # VAD detected silence after speech
EndFrame, # Pipeline shutdown signal
)Each processor is a Python class that receives frames, processes them, and pushes new frames downstream. Here's a minimal custom processor that logs every transcript before passing it through:
from pipecat.processors.frame_processor import FrameProcessor, FrameDirection
class TranscriptLogger(FrameProcessor):
"""Logs every final transcript passing through the pipeline."""
async def process_frame(self, frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, TranscriptionFrame):
print(f"[USER] {frame.text}")
# Always push the frame downstream so the pipeline continues
await self.push_frame(frame, direction)Frames also flow in two directions. Downstream frames (audio in, transcripts, LLM tokens, TTS audio) flow left to right through the pipeline. Upstream frames (interruptions, control signals) flow right to left, telling earlier stages to cancel their current work. This bidirectional flow is what makes interruption handling possible without complex callback spaghetti.
Building the pipeline in Pipecat
Now we assemble the full pipeline. Pipecat uses a Pipeline class that chains processors together. Each processor handles one responsibility: capturing audio, transcribing speech, running the LLM, synthesizing audio, and playing it back.
The pipeline definition is declarative. You list the processors in order and Pipecat wires the frame routing automatically:
import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
async def main():
# Transport: WebRTC via Daily
transport = DailyTransport(
room_url="https://your-domain.daily.co/your-room",
token="your-daily-token",
bot_name="Voice Agent",
params=DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(
stop_secs=0.3, # Silence duration before end-of-turn
min_volume=0.1,
)),
vad_audio_passthrough=True,
),
)
# Stage 1: Speech-to-Text
stt = DeepgramSTTService(
api_key="your-deepgram-key",
sample_rate=16000,
settings=DeepgramSTTService.Settings(
model="nova-3", # Deepgram's fastest model
language="en",
),
)
# Stage 2: Language Model
llm = OpenAILLMService(
api_key="your-openai-key",
model="gpt-4o-mini", # Fast inference, good enough for voice
)
# Stage 3: Text-to-Speech
tts = CartesiaTTSService(
api_key="your-cartesia-key",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
model="sonic-2", # Sub-100ms first-byte latency
sample_rate=16000,
)
# System prompt: defines the agent's personality and constraints
messages = [
{
"role": "system",
"content": (
"You are a helpful voice assistant for a software company. "
"Keep responses concise - under 3 sentences for simple questions. "
"Use natural conversational language, not formal writing. "
"If you don't know something, say so directly."
),
}
]
# Wire the pipeline: transport in → STT → LLM → TTS → transport out
pipeline = Pipeline([
transport.input(), # Microphone audio frames
stt, # Audio → TranscriptionFrame
llm, # TranscriptionFrame → TextFrames (tokens)
tts, # TextFrames → TTSAudioRawFrame
transport.output(), # Audio frames → speaker
])
task = PipelineTask(
pipeline,
params=PipelineParams(
allow_interruptions=True,
enable_metrics=True,
),
)
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
asyncio.run(main())That's a working voice agent in about 60 lines. The transport handles WebRTC negotiation, VAD detects when the user stops speaking, and frames flow through STT, LLM, and TTS concurrently.
But those 60 lines hide enormous complexity. Let's crack open each stage and see where latency actually lives.
Where does every millisecond go?
The 300ms budget breaks down into four slices: STT finalization, LLM time-to-first-token, TTS time-to-first-byte, and transport overhead. In practice, most teams blow the budget on the LLM slice because they pick the wrong model or don't stream properly. Here's the real breakdown.
The latency budget
| Stage | Target | What's happening |
|---|---|---|
| STT finalization | 50-100ms | Confirming the final transcript after speech ends |
| LLM first token | 100-200ms | Model processes prompt, generates first response token |
| TTS first byte | 50-80ms | Converting first token chunk to audio |
| Transport | 20-50ms (WebRTC) / 150-700ms (PSTN) | Network round trip |
| Total | 220-430ms (WebRTC) / 350-1080ms (phone) | End-to-end response time |
The numbers tell the story. On WebRTC, you have a comfortable margin. On a phone call through Twilio, you are already over budget before the LLM generates a single token.
Measuring latency in Pipecat
Pipecat emits metrics frames that track timing across each stage. Drop this monitor into your pipeline to see exactly where your budget goes:
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import MetricsFrame
from pipecat.metrics.metrics import TTFBMetricsData
class LatencyMonitor(FrameProcessor):
"""Tracks and logs per-stage latency metrics."""
async def process_frame(self, frame, direction):
await super().process_frame(frame, direction)
if isinstance(frame, MetricsFrame):
for metric in frame.data:
if isinstance(metric, TTFBMetricsData):
# metric.value is in seconds, convert to ms
ms = metric.value * 1000
print(f" [{metric.processor}] TTFB: {ms:.0f}ms")
await self.push_frame(frame, direction)Insert this monitor after TTS in the pipeline and you'll see output like:
[deepgram-stt] TTFB: 82ms
[openai-llm] TTFB: 187ms
[cartesia-tts] TTFB: 64msThat's 333ms. Close to budget. The LLM is the bottleneck, as it almost always is.
Choosing models for latency
Model selection determines whether you hit the budget. Here's what we've measured in production:
| Provider | Model | Typical TTFT | Best for |
|---|---|---|---|
| OpenAI | gpt-4o-mini | 120-200ms | General voice agents |
| OpenAI | gpt-4o | 250-500ms | Complex reasoning (blows budget) |
| Anthropic | Claude Haiku 3.5 | 150-250ms | Tool-heavy agents |
| Groq | llama-3.3-70b | 50-100ms | Speed-critical, simpler tasks |
| Gemini 2.0 Flash | 100-180ms | Multimodal inputs |
The fastest path to a responsive agent is gpt-4o-mini or Groq for the LLM, nova-3 for STT, and Cartesia sonic-2 for TTS. If your agent needs advanced tool calling, Claude Haiku handles complex MCP tool chains without destroying the latency budget.
For voice-specific prompt engineering, the key principle is brevity. Every extra sentence in the system prompt adds inference time. Voice prompts should be 3-5 sentences, not the page-long instructions you would use for a text agent.
How does turn detection actually work?
This is where voice agents feel human or feel robotic. Turn detection determines when the user has finished speaking and the agent should respond. Get it wrong and the agent either talks over the user (too aggressive) or leaves awkward silences (too conservative). VAD alone gets you 70% of the way. Semantic turn detection handles the other 30%.
VAD: the baseline
Voice Activity Detection analyzes the audio signal to detect speech vs silence. Silero VAD, which Pipecat uses by default, is a small neural network that classifies each audio frame as speech or non-speech. When it detects a silence gap longer than a threshold (typically 300-800ms), it marks the end of the user's turn.
The problem? Humans pause mid-sentence all the time. "I want to book a flight to..." (300ms pause while thinking) "...San Francisco." A 300ms VAD threshold would trigger the agent to respond after "to," cutting off the user.
# VAD configuration in Pipecat's transport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
transport = DailyTransport(
room_url=room_url,
token=token,
bot_name="Agent",
params=DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(
stop_secs=0.3,
min_volume=0.1,
)),
vad_audio_passthrough=True,
),
)VAD parameters you can tune:
| Parameter | Default | Effect |
|---|---|---|
| Stop speaking threshold | 0.3s | Silence duration before triggering end-of-turn |
| Start speaking threshold | 0.2 | Silero confidence score to detect speech onset |
| Min speech duration | 0.1s | Ignore sounds shorter than this (coughs, clicks) |
Raising the stop threshold to 500ms+ reduces false triggers but adds that delay to every response. This is the fundamental tradeoff of VAD alone.
Semantic turn detection: understanding intent, not just silence
Semantic turn detection solves the mid-sentence pause problem by analyzing the transcript content rather than the audio signal. When VAD detects a potential end-of-turn, a lightweight model evaluates whether the transcript represents a complete thought.
LiveKit Agents pioneered this with a transformer-based turn detector that runs in under 75ms P99. The model receives the partial transcript and outputs a confidence score for "speaker has finished their turn." If the score is low (mid-sentence pause), the system waits. If the score is high (complete question), the system triggers the response immediately.
Pipecat supports this through custom turn detection processors. Here's the conceptual pattern (simplified for clarity):
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import (
TranscriptionFrame,
UserStartedSpeakingFrame,
UserStoppedSpeakingFrame,
)
class SmartTurnDetector(FrameProcessor):
"""Combines VAD silence detection with semantic completeness checking."""
def __init__(self, llm_service, confidence_threshold: float = 0.7):
super().__init__()
self._llm = llm_service
self._threshold = confidence_threshold
self._current_transcript = ""
self._speaking = False
async def process_frame(self, frame, direction):
await super().process_frame(frame, direction)
if isinstance(frame, UserStartedSpeakingFrame):
self._speaking = True
self._current_transcript = ""
await self.push_frame(frame, direction)
return
if isinstance(frame, TranscriptionFrame):
self._current_transcript += " " + frame.text
if isinstance(frame, UserStoppedSpeakingFrame):
self._speaking = False
# VAD says they stopped. But did they finish their thought?
is_complete = await self._check_completeness(
self._current_transcript.strip()
)
if is_complete:
# Genuine end of turn: pass the stop frame through
await self.push_frame(frame, direction)
else:
# Mid-sentence pause: swallow the stop frame, keep listening
pass
return
await self.push_frame(frame, direction)
async def _check_completeness(self, transcript: str) -> bool:
"""Use a fast model to check if the transcript is a complete thought."""
if len(transcript.split()) < 3:
return False # Too short to be a complete utterance
# In production, use a fine-tuned classifier for sub-10ms inference.
# This example uses the LLM for demonstration.
response = await self._llm.generate(
messages=[{
"role": "user",
"content": (
f"Is this a complete sentence or question? "
f"Answer only 'yes' or 'no': \"{transcript}\""
),
}],
max_tokens=3,
)
return "yes" in response.lower()In practice, you wouldn't call a full LLM for this. LiveKit uses a dedicated 3M-parameter transformer trained specifically on conversational turn boundaries. The principle is the same: use the meaning of the words, not just the absence of sound.
The results are significant. In our testing, semantic turn detection reduced false interruptions by 45% compared to VAD alone, while adding less than 50ms to the processing pipeline.
How do you handle interruptions?
Picture this: the agent is mid-sentence explaining your account balance, and the user blurts out "no, the other account." The pipeline must cancel everything in flight within 100ms. Stop the TTS playback. Flush buffered audio. Cancel the LLM generation. Start processing the new input. This is the hardest concurrency problem in voice AI, and getting it wrong produces the worst user experience: talking over each other.
Pipecat handles this through interruption frames that propagate upstream through the pipeline:
The allow_interruptions=True parameter in PipelineParams enables this behavior. Without it, the pipeline buffers the user's speech until the agent finishes talking, which feels terrible for the caller.
Here's how to add custom interruption behavior, like logging what the agent was saying when it got cut off:
class InterruptionTracker(FrameProcessor):
"""Tracks what the agent was saying when interrupted."""
def __init__(self):
super().__init__()
self._current_response = []
async def process_frame(self, frame, direction):
await super().process_frame(frame, direction)
if isinstance(frame, TextFrame):
self._current_response.append(frame.text)
if isinstance(frame, StartInterruptionFrame):
if self._current_response:
partial = "".join(self._current_response)
print(f"[INTERRUPTED] Agent was saying: '{partial}...'")
self._current_response = []
await self.push_frame(frame, direction)What happens during an interruption
The timeline of an interruption in a well-tuned pipeline looks like this:
- T+0ms: User starts speaking. VAD detects speech onset.
- T+10ms: Transport emits
StartInterruptionFrame. - T+15ms: TTS receives the frame, stops playback, flushes audio buffer.
- T+20ms: LLM receives the frame, cancels the current generation.
- T+25ms: Pipeline is clear and processing new audio.
- T+50-100ms: First partial transcript from new user speech arrives.
Total interruption latency: 25ms to clear the pipeline. The user perceives no overlap because audio playback stops within one frame (20ms).
Why does transport choice matter so much?
WebRTC saves 150-700ms compared to PSTN phone calls. That's not a marginal optimization. It's the difference between a pipeline that hits the 300ms budget comfortably and one that can't possibly meet it. The transport layer is the one part of the latency budget you can't optimize with better models.
WebRTC vs PSTN
| Factor | WebRTC | PSTN (Twilio) |
|---|---|---|
| Network path | Browser to server (UDP) | Phone to carrier to Twilio to server |
| Typical RTT | 20-50ms | 150-400ms |
| Codec overhead | Opus (native) | G.711 to Opus transcoding |
| Jitter buffer | Adaptive, minimal | Fixed, adds 40-80ms |
| Total transport overhead | 30-60ms | 200-700ms |
With WebRTC, you have 240-270ms left for STT + LLM + TTS. Plenty of room. With a phone call, you might have 0-100ms left after transport eats the budget. This is why browser-based voice agents feel dramatically snappier than phone-based ones.
Pipecat supports both through its transport abstraction. The pipeline code stays the same. You swap the transport layer:
# WebRTC transport (Daily)
from pipecat.transports.services.daily import DailyTransport, DailyParams
transport = DailyTransport(
room_url="https://your-domain.daily.co/room",
token="your-token",
bot_name="Agent",
params=DailyParams(audio_in_enabled=True, audio_out_enabled=True),
)
# Phone transport (Twilio via WebSocket + serializer)
from pipecat.transports.websocket.fastapi import (
FastAPIWebsocketTransport, FastAPIWebsocketParams,
)
from pipecat.serializers.twilio import TwilioFrameSerializer
transport = FastAPIWebsocketTransport(
websocket=websocket, # From your FastAPI /ws endpoint
params=FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
serializer=TwilioFrameSerializer(), # Handles Twilio's media format
),
)The rest of the pipeline, STT through TTS, doesn't change. Swap the transport and everything else stays the same.
When phone transport is worth the latency cost
Not every use case can use WebRTC. Inbound customer service lines need to answer real phone calls. Outbound campaigns dial actual phone numbers. Healthcare reminder systems call patients who don't have smartphones. In these cases, you accept the PSTN overhead and optimize everything else aggressively:
- Use Groq for LLM (50-100ms TTFT instead of 200ms)
- Use Deepgram nova-3 with
smart_format=false(saves 20-30ms) - Use Cartesia sonic-2 with the lowest-latency voice preset
- Tune VAD to a shorter stop threshold (250ms instead of 300ms)
Even with all of these optimizations, phone-based agents will feel slightly slower than WebRTC ones. That's physics, not engineering.
Pipecat Flows: structured conversations
Raw LLM conversations are freeform by definition. The model generates whatever seems appropriate given the context. For many voice use cases, that's exactly wrong. An appointment booking agent needs to collect specific information in a specific order. A payment processing agent needs to follow a compliance-mandated script. Pipecat Flows solves this with a state machine model for structured conversations.
A Flow defines nodes (conversation states) and edges (transitions between them). Each node has its own system prompt, expected actions, and transition conditions:
flow_config = {
"initial_node": "greeting",
"nodes": {
"greeting": {
"role_messages": [{
"role": "system",
"content": (
"Greet the caller and ask how you can help. "
"Keep it to one sentence."
),
}],
"functions": [{
"type": "function",
"function": {
"name": "handle_intent",
"description": "Route based on user intent",
"parameters": {
"type": "object",
"properties": {
"intent": {
"type": "string",
"enum": [
"book_appointment",
"check_status",
"speak_to_human",
],
}
},
"required": ["intent"],
},
},
"handler": "handle_intent_transition",
}],
},
"book_appointment": {
"role_messages": [{
"role": "system",
"content": (
"Collect the following: preferred date, preferred time, "
"and reason for visit. Confirm each before moving on."
),
}],
"functions": [{
"type": "function",
"function": {
"name": "confirm_booking",
"description": "All details collected, confirm the appointment",
"parameters": {
"type": "object",
"properties": {
"date": {"type": "string"},
"time": {"type": "string"},
"reason": {"type": "string"},
},
"required": ["date", "time", "reason"],
},
},
"handler": "create_appointment",
}],
},
"confirmation": {
"role_messages": [{
"role": "system",
"content": (
"Confirm the appointment details and ask if there's "
"anything else. If not, say goodbye."
),
}],
},
},
}This isn't just prompt switching. Each node controls which tools the agent can call, what information it collects, and which transitions are valid. The LLM can't skip ahead or go off-script because the available function calls change at each state.
For agents that use MCP-based tools, Flows integrates cleanly. You define which MCP tools are available at each node, so the agent can only access tools that are relevant to the current conversation state.
Pipecat vs LiveKit Agents vs VAPI
Now that you've seen how a pipeline works at the frame level, the natural question: should you actually build all this yourself? The voice AI framework space splits into three tiers: build-it-yourself frameworks, managed infrastructure with code, and fully managed APIs. The right choice depends on how much control you need vs how fast you need to ship.
Pipecat: full control, full responsibility
Pipecat is a Python framework. You write the pipeline, choose every provider, and handle deployment yourself. The framework provides the streaming architecture, frame routing, and concurrency primitives. You provide everything else.
Best for: Teams that need custom pipeline logic, unusual provider combinations, or fine-grained latency optimization. If you need to insert a custom audio processor between STT and LLM (for profanity filtering, language detection, or audio analysis), Pipecat lets you do it with a 20-line processor class.
Trade-off: You own deployment, scaling, and monitoring. Pipecat doesn't provide infrastructure.
LiveKit Agents: infrastructure included
LiveKit Agents is a higher-level framework built on the LiveKit WebRTC infrastructure. You write agent code in Python, and LiveKit handles the WebRTC transport, room management, and scaling. Their signature feature is a transformer-based semantic turn detector with sub-75ms P99 latency.
Best for: Teams that want WebRTC-quality transport without managing WebRTC infrastructure. LiveKit's turn detection is arguably the best available as of early 2026.
Trade-off: Tied to LiveKit's infrastructure and pricing. Less flexibility in transport choices. Phone support requires their SIP integration.
VAPI: no pipeline code
VAPI is a fully managed API. You configure an agent through REST endpoints or their dashboard, specifying the STT/LLM/TTS providers, system prompt, and tools. VAPI handles the pipeline, transport, scaling, and turn detection.
Best for: Teams that need a voice agent in production this week, not this quarter. Prototyping. Use cases where the standard pipeline is sufficient.
Trade-off: Limited customization. You can't insert custom processors, control frame routing, or implement non-standard turn detection. If you need something VAPI doesn't support, you're stuck.
Comparison matrix
| Feature | Pipecat | LiveKit Agents | VAPI |
|---|---|---|---|
| Pipeline control | Full (you write it) | Medium (hooks + callbacks) | None (config only) |
| Transport | Any (Daily, Twilio, custom) | LiveKit WebRTC + SIP | Managed (WebRTC + phone) |
| Turn detection | Configurable (VAD, custom) | Semantic transformer (best) | Managed (opaque) |
| Custom processors | Yes (FrameProcessor class) | Limited (event handlers) | No |
| Deployment | You manage | LiveKit Cloud or self-host | Fully managed |
| Time to first demo | Hours | Hours | Minutes |
| Time to production | Weeks | Days to weeks | Days |
| Pricing model | Provider costs only | LiveKit infra + providers | Per-minute + providers |
For the examples in this article, we use Pipecat because it exposes every layer. Once you understand how frames flow through a pipeline, you can work with any framework. The concepts transfer directly.
Monitoring voice agents in production
Here's the uncomfortable truth: a voice pipeline that works in development breaks in production in ways text agents never do. Network jitter causes audio gaps. STT accuracy drops on accented speech. TTS occasionally generates garbled audio on long sentences. The only defense is continuous monitoring across every stage.
The metrics that matter:
| Metric | Target | What it tells you |
|---|---|---|
| TTFR (Time to First Response) | < 300ms (WebRTC), < 800ms (phone) | End-to-end user experience |
| STT TTFB | < 100ms | Transcription engine health |
| LLM TTFT | < 200ms | Model inference speed |
| TTS TTFB | < 80ms | Speech synthesis latency |
| Interruption rate | < 15% of turns | Turn detection quality |
| STT word error rate | < 10% | Transcription accuracy |
| Call completion rate | > 85% | Agent solving problems, not frustrating users |
Each metric maps to a different failure mode. High TTFR with normal component latencies points to transport issues. High interruption rate means your turn detection is too aggressive. Low completion rate means the agent is failing at the task, regardless of latency.
Chanl's monitoring dashboard tracks these metrics across every call in real time. You can drill into individual calls to see the frame-by-frame latency breakdown, listen to the audio, and read the transcript. Scorecards evaluate each call against quality criteria automatically: did the agent collect all required information? Was the tone appropriate? Did it hallucinate any facts?
For pre-production testing, scenario simulation lets you run hundreds of synthetic conversations against your voice agent before exposing it to real users. You define personas (impatient caller, heavy accent, frequent interrupter) and the system runs them against your pipeline, measuring every metric listed above.
Building a voice pipeline is the first half. Knowing whether it actually works for real users is the second half. Systematic evaluation is just as important for voice agents as it is for text-based ones, and arguably harder because you're evaluating audio quality and conversational dynamics in addition to response correctness.
Putting it all together
Here's the complete pipeline with turn detection, interruption handling, latency monitoring, and structured conversation flow:
import asyncio
import os
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import (
MetricsFrame,
TranscriptionFrame,
StartInterruptionFrame,
)
from pipecat.metrics.metrics import TTFBMetricsData
class PipelineMonitor(FrameProcessor):
"""Logs latency metrics and conversation events."""
def __init__(self):
super().__init__()
self._turn_count = 0
async def process_frame(self, frame, direction):
await super().process_frame(frame, direction)
if isinstance(frame, TranscriptionFrame):
self._turn_count += 1
print(f"[Turn {self._turn_count}] User: {frame.text}")
if isinstance(frame, StartInterruptionFrame):
print(f"[Turn {self._turn_count}] User interrupted agent")
if isinstance(frame, MetricsFrame):
for m in frame.data:
if isinstance(m, TTFBMetricsData):
print(f" {m.processor}: {m.value * 1000:.0f}ms")
await self.push_frame(frame, direction)
async def main():
transport = DailyTransport(
room_url=os.getenv("DAILY_ROOM_URL"),
token=os.getenv("DAILY_TOKEN"),
bot_name="Voice Agent",
params=DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(
stop_secs=0.3,
min_volume=0.1,
)),
vad_audio_passthrough=True,
),
)
stt = DeepgramSTTService(
api_key=os.getenv("DEEPGRAM_API_KEY"),
settings=DeepgramSTTService.Settings(
model="nova-3",
language="en",
),
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o-mini",
)
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
model="sonic-2",
)
monitor = PipelineMonitor()
messages = [
{
"role": "system",
"content": (
"You are a voice assistant for a tech company. "
"Be concise and conversational. "
"Answer in 1-3 sentences unless the user asks for detail. "
"Never say 'as an AI' or 'I don't have feelings.' "
"Sound like a helpful coworker, not a robot."
),
}
]
pipeline = Pipeline([
transport.input(),
stt,
monitor, # Log transcripts and metrics
llm,
tts,
transport.output(),
])
task = PipelineTask(
pipeline,
params=PipelineParams(
allow_interruptions=True,
enable_metrics=True,
),
)
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
asyncio.run(main())What should you build next?
You now have a working voice pipeline, a clear understanding of the latency budget, and the tools to measure it. The natural next steps depend on your use case.
If you're building a customer-facing voice agent, add MCP tools so the agent can actually look up orders, check account status, and take actions. Our MCP tutorial walks through building an MCP server from scratch. You'll also want persistent memory so the agent remembers returning callers.
If you're evaluating voice quality before launch, run scenario simulations with diverse personas. An agent that works with your accent and speaking pace might fail with a caller who speaks twice as fast or pauses mid-sentence for 2 seconds.
If you're optimizing an existing pipeline, start measuring. Add the latency monitor from this article, run 100 test calls, and look at the P95 numbers rather than averages. The worst 5% of calls define your user experience more than the median does.
The voice AI pipeline isn't conceptually complex. Three stages, streaming frames, concurrent execution. The difficulty is in the details: turn detection that doesn't interrupt people, interruption handling that clears the pipeline in 20ms, transport choices that leave enough budget for actual AI processing, and monitoring that catches degradation before users complain.
Remember that 1.4-second response time we started with? The fix wasn't a faster model. It was switching from PSTN to WebRTC (saved 400ms), dropping to gpt-4o-mini (saved 350ms), and adding the latency monitor so we could see the remaining bottleneck in TTS warmup. Total: 280ms. Users stopped comparing it to a bad phone connection and started comparing it to a fast coworker.
Every millisecond has a home in the budget. The teams that build great voice agents are the ones that know exactly where each one goes.
Monitor every millisecond of your voice pipeline
Chanl tracks STT, LLM, and TTS latency per call, scores conversation quality automatically, and simulates hundreds of test calls before you go live.
Start building freeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



