ChanlChanl
Learning AI

Voice AI pipeline: STT, LLM, TTS and the 300ms budget

Build a real-time voice pipeline with Pipecat. How STT, LLM, and TTS stream concurrently under a 300ms latency budget, with turn detection and interruptions.

DGDean GroverCo-founderFollow
April 1, 2026
22 min read
Person wearing a headset at a desk with sound waveforms visible on screen, golden amber atmosphere

Our voice agent took 1.4 seconds to respond. Users described it as "talking to someone on a bad phone connection." We traced the problem through every layer: 280ms in speech-to-text, 600ms waiting for the LLM to start generating, 320ms before the first audio reached the speaker, and 200ms of transport overhead nobody had measured. The fix wasn't faster models. It was understanding how the pipeline actually works and where every millisecond goes.

This article builds a real-time voice AI pipeline from scratch. We'll start with a single audio frame and follow it through speech recognition, language model inference, and speech synthesis. You'll see exactly where latency hides, how streaming eliminates the biggest bottleneck, and how to get the whole thing under 300ms.

What you'll buildWhat you'll learn
Streaming voice pipelineHow STT, LLM, and TTS run concurrently through frames
Latency budget breakdownWhere every millisecond goes and how to measure it
Smart turn detectionVAD vs semantic end-of-turn and why it matters
Interruption handlerStopping mid-sentence when the user cuts in
Transport layerWhy WebRTC saves 700ms over phone calls
Framework comparisonPipecat vs LiveKit vs VAPI for different use cases

What you'll need

Runtime:

  • Python 3.11+ (Pipecat is Python-native)
  • API keys for Deepgram (STT), OpenAI (LLM), and Cartesia or ElevenLabs (TTS)

Install dependencies:

bash
pip install "pipecat-ai[deepgram,openai,cartesia,daily,silero]"

Set your keys:

bash
export DEEPGRAM_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export CARTESIA_API_KEY="your-key"

What does a voice AI pipeline actually do?

A voice AI pipeline converts spoken audio to text, reasons about a response, and converts that response back to audio, all in real time. The entire round trip needs to happen in under 300ms to feel like a natural conversation. The trick is that these three stages don't run one after the other. They stream concurrently.

Here's the mental model. A human conversation has a natural rhythm. One person speaks, there's a brief pause (typically 200-300ms), and the other person responds. If that pause stretches to 600ms, the conversation feels sluggish. Past a full second, it feels broken. Your AI agent is held to the same standard.

The pipeline has three core stages and a transport layer on each end:

Transport Layer Speech-to-Text Language Model Text-to-Speech Output Microphone(WebRTC / Twilio) Deepgram / WhisperAudio → Text GPT-4o / ClaudeText → Response Tokens Cartesia / ElevenLabsTokens → Audio Speaker(WebRTC / Twilio)
Voice AI pipeline: audio frames flow through three concurrent stages

Without streaming, each stage waits for the previous one to finish completely. The user speaks for 3 seconds, STT processes the full utterance (280ms), the LLM generates the complete response (2-4 seconds), and TTS synthesizes all the audio (500ms). Total response time: 3+ seconds after the user stops talking. Nobody would use this.

With streaming, STT emits partial transcripts while the user is still speaking. The LLM starts generating tokens as soon as it has enough context. TTS converts each token chunk to audio the moment it arrives. The user hears the first word of the response within 300ms of finishing their sentence, while the rest of the response is still being generated.

That's the architecture. Let's build it.

How does frame-based streaming work?

Frame-based streaming treats audio and text as a continuous river of small typed objects called frames. Each processor in the pipeline consumes one frame type, does its work, and emits the next frame type downstream. Audio frames carry 20ms of PCM samples. Transcription frames carry text. LLM frames carry tokens. This design lets every stage run concurrently on its own async task.

Pipecat's frame model is the foundation of everything else. If you have built data pipelines with tools like Apache Kafka or Unix pipes, the concept is familiar: small messages flowing through a chain of processors.

Here are the core frame types you'll work with:

python
from pipecat.frames.frames import (
    AudioRawFrame,          # 20ms of PCM audio samples
    TranscriptionFrame,     # Final transcript from STT
    InterimTranscriptionFrame,  # Partial transcript (still speaking)
    TextFrame,              # LLM output token
    TTSAudioRawFrame,       # Synthesized audio from TTS
    StartInterruptionFrame, # User started speaking (cancel current output)
    UserStartedSpeakingFrame,  # VAD detected speech onset
    UserStoppedSpeakingFrame,  # VAD detected silence after speech
    EndFrame,               # Pipeline shutdown signal
)

Each processor is a Python class that receives frames, processes them, and pushes new frames downstream. Here's a minimal custom processor that logs every transcript before passing it through:

python
from pipecat.processors.frame_processor import FrameProcessor, FrameDirection
 
class TranscriptLogger(FrameProcessor):
    """Logs every final transcript passing through the pipeline."""
 
    async def process_frame(self, frame, direction: FrameDirection):
        await super().process_frame(frame, direction)
 
        if isinstance(frame, TranscriptionFrame):
            print(f"[USER] {frame.text}")
 
        # Always push the frame downstream so the pipeline continues
        await self.push_frame(frame, direction)

Frames also flow in two directions. Downstream frames (audio in, transcripts, LLM tokens, TTS audio) flow left to right through the pipeline. Upstream frames (interruptions, control signals) flow right to left, telling earlier stages to cancel their current work. This bidirectional flow is what makes interruption handling possible without complex callback spaghetti.

Building the pipeline in Pipecat

Now we assemble the full pipeline. Pipecat uses a Pipeline class that chains processors together. Each processor handles one responsibility: capturing audio, transcribing speech, running the LLM, synthesizing audio, and playing it back.

The pipeline definition is declarative. You list the processors in order and Pipecat wires the frame routing automatically:

python
import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
 
async def main():
    # Transport: WebRTC via Daily
    transport = DailyTransport(
        room_url="https://your-domain.daily.co/your-room",
        token="your-daily-token",
        bot_name="Voice Agent",
        params=DailyParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(params=VADParams(
                stop_secs=0.3,          # Silence duration before end-of-turn
                min_volume=0.1,
            )),
            vad_audio_passthrough=True,
        ),
    )
 
    # Stage 1: Speech-to-Text
    stt = DeepgramSTTService(
        api_key="your-deepgram-key",
        sample_rate=16000,
        settings=DeepgramSTTService.Settings(
            model="nova-3",           # Deepgram's fastest model
            language="en",
        ),
    )
 
    # Stage 2: Language Model
    llm = OpenAILLMService(
        api_key="your-openai-key",
        model="gpt-4o-mini",           # Fast inference, good enough for voice
    )
 
    # Stage 3: Text-to-Speech
    tts = CartesiaTTSService(
        api_key="your-cartesia-key",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        model="sonic-2",           # Sub-100ms first-byte latency
        sample_rate=16000,
    )
 
    # System prompt: defines the agent's personality and constraints
    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful voice assistant for a software company. "
                "Keep responses concise - under 3 sentences for simple questions. "
                "Use natural conversational language, not formal writing. "
                "If you don't know something, say so directly."
            ),
        }
    ]
 
    # Wire the pipeline: transport in → STT → LLM → TTS → transport out
    pipeline = Pipeline([
        transport.input(),       # Microphone audio frames
        stt,                     # Audio → TranscriptionFrame
        llm,                     # TranscriptionFrame → TextFrames (tokens)
        tts,                     # TextFrames → TTSAudioRawFrame
        transport.output(),      # Audio frames → speaker
    ])
 
    task = PipelineTask(
        pipeline,
        params=PipelineParams(
            allow_interruptions=True,
            enable_metrics=True,
        ),
    )
 
    runner = PipelineRunner()
    await runner.run(task)
 
if __name__ == "__main__":
    asyncio.run(main())

That's a working voice agent in about 60 lines. The transport handles WebRTC negotiation, VAD detects when the user stops speaking, and frames flow through STT, LLM, and TTS concurrently.

But those 60 lines hide enormous complexity. Let's crack open each stage and see where latency actually lives.

Where does every millisecond go?

The 300ms budget breaks down into four slices: STT finalization, LLM time-to-first-token, TTS time-to-first-byte, and transport overhead. In practice, most teams blow the budget on the LLM slice because they pick the wrong model or don't stream properly. Here's the real breakdown.

The latency budget

StageTargetWhat's happening
STT finalization50-100msConfirming the final transcript after speech ends
LLM first token100-200msModel processes prompt, generates first response token
TTS first byte50-80msConverting first token chunk to audio
Transport20-50ms (WebRTC) / 150-700ms (PSTN)Network round trip
Total220-430ms (WebRTC) / 350-1080ms (phone)End-to-end response time

The numbers tell the story. On WebRTC, you have a comfortable margin. On a phone call through Twilio, you are already over budget before the LLM generates a single token.

Measuring latency in Pipecat

Pipecat emits metrics frames that track timing across each stage. Drop this monitor into your pipeline to see exactly where your budget goes:

python
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import MetricsFrame
from pipecat.metrics.metrics import TTFBMetricsData
 
class LatencyMonitor(FrameProcessor):
    """Tracks and logs per-stage latency metrics."""
 
    async def process_frame(self, frame, direction):
        await super().process_frame(frame, direction)
 
        if isinstance(frame, MetricsFrame):
            for metric in frame.data:
                if isinstance(metric, TTFBMetricsData):
                    # metric.value is in seconds, convert to ms
                    ms = metric.value * 1000
                    print(f"  [{metric.processor}] TTFB: {ms:.0f}ms")
 
        await self.push_frame(frame, direction)

Insert this monitor after TTS in the pipeline and you'll see output like:

text
  [deepgram-stt] TTFB: 82ms
  [openai-llm] TTFB: 187ms
  [cartesia-tts] TTFB: 64ms

That's 333ms. Close to budget. The LLM is the bottleneck, as it almost always is.

Choosing models for latency

Model selection determines whether you hit the budget. Here's what we've measured in production:

ProviderModelTypical TTFTBest for
OpenAIgpt-4o-mini120-200msGeneral voice agents
OpenAIgpt-4o250-500msComplex reasoning (blows budget)
AnthropicClaude Haiku 3.5150-250msTool-heavy agents
Groqllama-3.3-70b50-100msSpeed-critical, simpler tasks
GoogleGemini 2.0 Flash100-180msMultimodal inputs

The fastest path to a responsive agent is gpt-4o-mini or Groq for the LLM, nova-3 for STT, and Cartesia sonic-2 for TTS. If your agent needs advanced tool calling, Claude Haiku handles complex MCP tool chains without destroying the latency budget.

For voice-specific prompt engineering, the key principle is brevity. Every extra sentence in the system prompt adds inference time. Voice prompts should be 3-5 sentences, not the page-long instructions you would use for a text agent.

How does turn detection actually work?

This is where voice agents feel human or feel robotic. Turn detection determines when the user has finished speaking and the agent should respond. Get it wrong and the agent either talks over the user (too aggressive) or leaves awkward silences (too conservative). VAD alone gets you 70% of the way. Semantic turn detection handles the other 30%.

VAD: the baseline

Voice Activity Detection analyzes the audio signal to detect speech vs silence. Silero VAD, which Pipecat uses by default, is a small neural network that classifies each audio frame as speech or non-speech. When it detects a silence gap longer than a threshold (typically 300-800ms), it marks the end of the user's turn.

The problem? Humans pause mid-sentence all the time. "I want to book a flight to..." (300ms pause while thinking) "...San Francisco." A 300ms VAD threshold would trigger the agent to respond after "to," cutting off the user.

python
# VAD configuration in Pipecat's transport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
 
transport = DailyTransport(
    room_url=room_url,
    token=token,
    bot_name="Agent",
    params=DailyParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(params=VADParams(
            stop_secs=0.3,
            min_volume=0.1,
        )),
        vad_audio_passthrough=True,
    ),
)

VAD parameters you can tune:

ParameterDefaultEffect
Stop speaking threshold0.3sSilence duration before triggering end-of-turn
Start speaking threshold0.2Silero confidence score to detect speech onset
Min speech duration0.1sIgnore sounds shorter than this (coughs, clicks)

Raising the stop threshold to 500ms+ reduces false triggers but adds that delay to every response. This is the fundamental tradeoff of VAD alone.

Semantic turn detection: understanding intent, not just silence

Semantic turn detection solves the mid-sentence pause problem by analyzing the transcript content rather than the audio signal. When VAD detects a potential end-of-turn, a lightweight model evaluates whether the transcript represents a complete thought.

LiveKit Agents pioneered this with a transformer-based turn detector that runs in under 75ms P99. The model receives the partial transcript and outputs a confidence score for "speaker has finished their turn." If the score is low (mid-sentence pause), the system waits. If the score is high (complete question), the system triggers the response immediately.

Pipecat supports this through custom turn detection processors. Here's the conceptual pattern (simplified for clarity):

python
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import (
    TranscriptionFrame,
    UserStartedSpeakingFrame,
    UserStoppedSpeakingFrame,
)
 
class SmartTurnDetector(FrameProcessor):
    """Combines VAD silence detection with semantic completeness checking."""
 
    def __init__(self, llm_service, confidence_threshold: float = 0.7):
        super().__init__()
        self._llm = llm_service
        self._threshold = confidence_threshold
        self._current_transcript = ""
        self._speaking = False
 
    async def process_frame(self, frame, direction):
        await super().process_frame(frame, direction)
 
        if isinstance(frame, UserStartedSpeakingFrame):
            self._speaking = True
            self._current_transcript = ""
            await self.push_frame(frame, direction)
            return
 
        if isinstance(frame, TranscriptionFrame):
            self._current_transcript += " " + frame.text
 
        if isinstance(frame, UserStoppedSpeakingFrame):
            self._speaking = False
 
            # VAD says they stopped. But did they finish their thought?
            is_complete = await self._check_completeness(
                self._current_transcript.strip()
            )
 
            if is_complete:
                # Genuine end of turn: pass the stop frame through
                await self.push_frame(frame, direction)
            else:
                # Mid-sentence pause: swallow the stop frame, keep listening
                pass
            return
 
        await self.push_frame(frame, direction)
 
    async def _check_completeness(self, transcript: str) -> bool:
        """Use a fast model to check if the transcript is a complete thought."""
        if len(transcript.split()) < 3:
            return False  # Too short to be a complete utterance
 
        # In production, use a fine-tuned classifier for sub-10ms inference.
        # This example uses the LLM for demonstration.
        response = await self._llm.generate(
            messages=[{
                "role": "user",
                "content": (
                    f"Is this a complete sentence or question? "
                    f"Answer only 'yes' or 'no': \"{transcript}\""
                ),
            }],
            max_tokens=3,
        )
        return "yes" in response.lower()

In practice, you wouldn't call a full LLM for this. LiveKit uses a dedicated 3M-parameter transformer trained specifically on conversational turn boundaries. The principle is the same: use the meaning of the words, not just the absence of sound.

The results are significant. In our testing, semantic turn detection reduced false interruptions by 45% compared to VAD alone, while adding less than 50ms to the processing pipeline.

How do you handle interruptions?

Picture this: the agent is mid-sentence explaining your account balance, and the user blurts out "no, the other account." The pipeline must cancel everything in flight within 100ms. Stop the TTS playback. Flush buffered audio. Cancel the LLM generation. Start processing the new input. This is the hardest concurrency problem in voice AI, and getting it wrong produces the worst user experience: talking over each other.

Pipecat handles this through interruption frames that propagate upstream through the pipeline:

Starts talking (VAD triggers) StartInterruptionFrame Stop playback, flush buffer StartInterruptionFrame Cancel generation StartInterruptionFrame Ready for new input New audio frames Process new speech Pipeline cleared in <100ms User Transport TTS LLM STT
Interruption frame propagates upstream, clearing buffers at each stage

The allow_interruptions=True parameter in PipelineParams enables this behavior. Without it, the pipeline buffers the user's speech until the agent finishes talking, which feels terrible for the caller.

Here's how to add custom interruption behavior, like logging what the agent was saying when it got cut off:

python
class InterruptionTracker(FrameProcessor):
    """Tracks what the agent was saying when interrupted."""
 
    def __init__(self):
        super().__init__()
        self._current_response = []
 
    async def process_frame(self, frame, direction):
        await super().process_frame(frame, direction)
 
        if isinstance(frame, TextFrame):
            self._current_response.append(frame.text)
 
        if isinstance(frame, StartInterruptionFrame):
            if self._current_response:
                partial = "".join(self._current_response)
                print(f"[INTERRUPTED] Agent was saying: '{partial}...'")
                self._current_response = []
 
        await self.push_frame(frame, direction)

What happens during an interruption

The timeline of an interruption in a well-tuned pipeline looks like this:

  1. T+0ms: User starts speaking. VAD detects speech onset.
  2. T+10ms: Transport emits StartInterruptionFrame.
  3. T+15ms: TTS receives the frame, stops playback, flushes audio buffer.
  4. T+20ms: LLM receives the frame, cancels the current generation.
  5. T+25ms: Pipeline is clear and processing new audio.
  6. T+50-100ms: First partial transcript from new user speech arrives.

Total interruption latency: 25ms to clear the pipeline. The user perceives no overlap because audio playback stops within one frame (20ms).

Why does transport choice matter so much?

WebRTC saves 150-700ms compared to PSTN phone calls. That's not a marginal optimization. It's the difference between a pipeline that hits the 300ms budget comfortably and one that can't possibly meet it. The transport layer is the one part of the latency budget you can't optimize with better models.

WebRTC vs PSTN

FactorWebRTCPSTN (Twilio)
Network pathBrowser to server (UDP)Phone to carrier to Twilio to server
Typical RTT20-50ms150-400ms
Codec overheadOpus (native)G.711 to Opus transcoding
Jitter bufferAdaptive, minimalFixed, adds 40-80ms
Total transport overhead30-60ms200-700ms

With WebRTC, you have 240-270ms left for STT + LLM + TTS. Plenty of room. With a phone call, you might have 0-100ms left after transport eats the budget. This is why browser-based voice agents feel dramatically snappier than phone-based ones.

Pipecat supports both through its transport abstraction. The pipeline code stays the same. You swap the transport layer:

python
# WebRTC transport (Daily)
from pipecat.transports.services.daily import DailyTransport, DailyParams
 
transport = DailyTransport(
    room_url="https://your-domain.daily.co/room",
    token="your-token",
    bot_name="Agent",
    params=DailyParams(audio_in_enabled=True, audio_out_enabled=True),
)
 
# Phone transport (Twilio via WebSocket + serializer)
from pipecat.transports.websocket.fastapi import (
    FastAPIWebsocketTransport, FastAPIWebsocketParams,
)
from pipecat.serializers.twilio import TwilioFrameSerializer
 
transport = FastAPIWebsocketTransport(
    websocket=websocket,  # From your FastAPI /ws endpoint
    params=FastAPIWebsocketParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        serializer=TwilioFrameSerializer(),  # Handles Twilio's media format
    ),
)

The rest of the pipeline, STT through TTS, doesn't change. Swap the transport and everything else stays the same.

When phone transport is worth the latency cost

Not every use case can use WebRTC. Inbound customer service lines need to answer real phone calls. Outbound campaigns dial actual phone numbers. Healthcare reminder systems call patients who don't have smartphones. In these cases, you accept the PSTN overhead and optimize everything else aggressively:

  • Use Groq for LLM (50-100ms TTFT instead of 200ms)
  • Use Deepgram nova-3 with smart_format=false (saves 20-30ms)
  • Use Cartesia sonic-2 with the lowest-latency voice preset
  • Tune VAD to a shorter stop threshold (250ms instead of 300ms)

Even with all of these optimizations, phone-based agents will feel slightly slower than WebRTC ones. That's physics, not engineering.

Pipecat Flows: structured conversations

Raw LLM conversations are freeform by definition. The model generates whatever seems appropriate given the context. For many voice use cases, that's exactly wrong. An appointment booking agent needs to collect specific information in a specific order. A payment processing agent needs to follow a compliance-mandated script. Pipecat Flows solves this with a state machine model for structured conversations.

A Flow defines nodes (conversation states) and edges (transitions between them). Each node has its own system prompt, expected actions, and transition conditions:

python
flow_config = {
    "initial_node": "greeting",
    "nodes": {
        "greeting": {
            "role_messages": [{
                "role": "system",
                "content": (
                    "Greet the caller and ask how you can help. "
                    "Keep it to one sentence."
                ),
            }],
            "functions": [{
                "type": "function",
                "function": {
                    "name": "handle_intent",
                    "description": "Route based on user intent",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "intent": {
                                "type": "string",
                                "enum": [
                                    "book_appointment",
                                    "check_status",
                                    "speak_to_human",
                                ],
                            }
                        },
                        "required": ["intent"],
                    },
                },
                "handler": "handle_intent_transition",
            }],
        },
        "book_appointment": {
            "role_messages": [{
                "role": "system",
                "content": (
                    "Collect the following: preferred date, preferred time, "
                    "and reason for visit. Confirm each before moving on."
                ),
            }],
            "functions": [{
                "type": "function",
                "function": {
                    "name": "confirm_booking",
                    "description": "All details collected, confirm the appointment",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "date": {"type": "string"},
                            "time": {"type": "string"},
                            "reason": {"type": "string"},
                        },
                        "required": ["date", "time", "reason"],
                    },
                },
                "handler": "create_appointment",
            }],
        },
        "confirmation": {
            "role_messages": [{
                "role": "system",
                "content": (
                    "Confirm the appointment details and ask if there's "
                    "anything else. If not, say goodbye."
                ),
            }],
        },
    },
}

This isn't just prompt switching. Each node controls which tools the agent can call, what information it collects, and which transitions are valid. The LLM can't skip ahead or go off-script because the available function calls change at each state.

For agents that use MCP-based tools, Flows integrates cleanly. You define which MCP tools are available at each node, so the agent can only access tools that are relevant to the current conversation state.

Pipecat vs LiveKit Agents vs VAPI

Now that you've seen how a pipeline works at the frame level, the natural question: should you actually build all this yourself? The voice AI framework space splits into three tiers: build-it-yourself frameworks, managed infrastructure with code, and fully managed APIs. The right choice depends on how much control you need vs how fast you need to ship.

Pipecat: full control, full responsibility

Pipecat is a Python framework. You write the pipeline, choose every provider, and handle deployment yourself. The framework provides the streaming architecture, frame routing, and concurrency primitives. You provide everything else.

Best for: Teams that need custom pipeline logic, unusual provider combinations, or fine-grained latency optimization. If you need to insert a custom audio processor between STT and LLM (for profanity filtering, language detection, or audio analysis), Pipecat lets you do it with a 20-line processor class.

Trade-off: You own deployment, scaling, and monitoring. Pipecat doesn't provide infrastructure.

LiveKit Agents: infrastructure included

LiveKit Agents is a higher-level framework built on the LiveKit WebRTC infrastructure. You write agent code in Python, and LiveKit handles the WebRTC transport, room management, and scaling. Their signature feature is a transformer-based semantic turn detector with sub-75ms P99 latency.

Best for: Teams that want WebRTC-quality transport without managing WebRTC infrastructure. LiveKit's turn detection is arguably the best available as of early 2026.

Trade-off: Tied to LiveKit's infrastructure and pricing. Less flexibility in transport choices. Phone support requires their SIP integration.

VAPI: no pipeline code

VAPI is a fully managed API. You configure an agent through REST endpoints or their dashboard, specifying the STT/LLM/TTS providers, system prompt, and tools. VAPI handles the pipeline, transport, scaling, and turn detection.

Best for: Teams that need a voice agent in production this week, not this quarter. Prototyping. Use cases where the standard pipeline is sufficient.

Trade-off: Limited customization. You can't insert custom processors, control frame routing, or implement non-standard turn detection. If you need something VAPI doesn't support, you're stuck.

Comparison matrix

FeaturePipecatLiveKit AgentsVAPI
Pipeline controlFull (you write it)Medium (hooks + callbacks)None (config only)
TransportAny (Daily, Twilio, custom)LiveKit WebRTC + SIPManaged (WebRTC + phone)
Turn detectionConfigurable (VAD, custom)Semantic transformer (best)Managed (opaque)
Custom processorsYes (FrameProcessor class)Limited (event handlers)No
DeploymentYou manageLiveKit Cloud or self-hostFully managed
Time to first demoHoursHoursMinutes
Time to productionWeeksDays to weeksDays
Pricing modelProvider costs onlyLiveKit infra + providersPer-minute + providers

For the examples in this article, we use Pipecat because it exposes every layer. Once you understand how frames flow through a pipeline, you can work with any framework. The concepts transfer directly.

Monitoring voice agents in production

Here's the uncomfortable truth: a voice pipeline that works in development breaks in production in ways text agents never do. Network jitter causes audio gaps. STT accuracy drops on accented speech. TTS occasionally generates garbled audio on long sentences. The only defense is continuous monitoring across every stage.

The metrics that matter:

MetricTargetWhat it tells you
TTFR (Time to First Response)< 300ms (WebRTC), < 800ms (phone)End-to-end user experience
STT TTFB< 100msTranscription engine health
LLM TTFT< 200msModel inference speed
TTS TTFB< 80msSpeech synthesis latency
Interruption rate< 15% of turnsTurn detection quality
STT word error rate< 10%Transcription accuracy
Call completion rate> 85%Agent solving problems, not frustrating users

Each metric maps to a different failure mode. High TTFR with normal component latencies points to transport issues. High interruption rate means your turn detection is too aggressive. Low completion rate means the agent is failing at the task, regardless of latency.

Chanl's monitoring dashboard tracks these metrics across every call in real time. You can drill into individual calls to see the frame-by-frame latency breakdown, listen to the audio, and read the transcript. Scorecards evaluate each call against quality criteria automatically: did the agent collect all required information? Was the tone appropriate? Did it hallucinate any facts?

For pre-production testing, scenario simulation lets you run hundreds of synthetic conversations against your voice agent before exposing it to real users. You define personas (impatient caller, heavy accent, frequent interrupter) and the system runs them against your pipeline, measuring every metric listed above.

Building a voice pipeline is the first half. Knowing whether it actually works for real users is the second half. Systematic evaluation is just as important for voice agents as it is for text-based ones, and arguably harder because you're evaluating audio quality and conversational dynamics in addition to response correctness.

Putting it all together

Here's the complete pipeline with turn detection, interruption handling, latency monitoring, and structured conversation flow:

python
import asyncio
import os
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import (
    MetricsFrame,
    TranscriptionFrame,
    StartInterruptionFrame,
)
from pipecat.metrics.metrics import TTFBMetricsData
 
class PipelineMonitor(FrameProcessor):
    """Logs latency metrics and conversation events."""
 
    def __init__(self):
        super().__init__()
        self._turn_count = 0
 
    async def process_frame(self, frame, direction):
        await super().process_frame(frame, direction)
 
        if isinstance(frame, TranscriptionFrame):
            self._turn_count += 1
            print(f"[Turn {self._turn_count}] User: {frame.text}")
 
        if isinstance(frame, StartInterruptionFrame):
            print(f"[Turn {self._turn_count}] User interrupted agent")
 
        if isinstance(frame, MetricsFrame):
            for m in frame.data:
                if isinstance(m, TTFBMetricsData):
                    print(f"  {m.processor}: {m.value * 1000:.0f}ms")
 
        await self.push_frame(frame, direction)
 
async def main():
    transport = DailyTransport(
        room_url=os.getenv("DAILY_ROOM_URL"),
        token=os.getenv("DAILY_TOKEN"),
        bot_name="Voice Agent",
        params=DailyParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(params=VADParams(
                stop_secs=0.3,
                min_volume=0.1,
            )),
            vad_audio_passthrough=True,
        ),
    )
 
    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_API_KEY"),
        settings=DeepgramSTTService.Settings(
            model="nova-3",
            language="en",
        ),
    )
 
    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4o-mini",
    )
 
    tts = CartesiaTTSService(
        api_key=os.getenv("CARTESIA_API_KEY"),
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        model="sonic-2",
    )
 
    monitor = PipelineMonitor()
 
    messages = [
        {
            "role": "system",
            "content": (
                "You are a voice assistant for a tech company. "
                "Be concise and conversational. "
                "Answer in 1-3 sentences unless the user asks for detail. "
                "Never say 'as an AI' or 'I don't have feelings.' "
                "Sound like a helpful coworker, not a robot."
            ),
        }
    ]
 
    pipeline = Pipeline([
        transport.input(),
        stt,
        monitor,        # Log transcripts and metrics
        llm,
        tts,
        transport.output(),
    ])
 
    task = PipelineTask(
        pipeline,
        params=PipelineParams(
            allow_interruptions=True,
            enable_metrics=True,
        ),
    )
 
    runner = PipelineRunner()
    await runner.run(task)
 
if __name__ == "__main__":
    asyncio.run(main())

What should you build next?

You now have a working voice pipeline, a clear understanding of the latency budget, and the tools to measure it. The natural next steps depend on your use case.

If you're building a customer-facing voice agent, add MCP tools so the agent can actually look up orders, check account status, and take actions. Our MCP tutorial walks through building an MCP server from scratch. You'll also want persistent memory so the agent remembers returning callers.

If you're evaluating voice quality before launch, run scenario simulations with diverse personas. An agent that works with your accent and speaking pace might fail with a caller who speaks twice as fast or pauses mid-sentence for 2 seconds.

If you're optimizing an existing pipeline, start measuring. Add the latency monitor from this article, run 100 test calls, and look at the P95 numbers rather than averages. The worst 5% of calls define your user experience more than the median does.

The voice AI pipeline isn't conceptually complex. Three stages, streaming frames, concurrent execution. The difficulty is in the details: turn detection that doesn't interrupt people, interruption handling that clears the pipeline in 20ms, transport choices that leave enough budget for actual AI processing, and monitoring that catches degradation before users complain.

Remember that 1.4-second response time we started with? The fix wasn't a faster model. It was switching from PSTN to WebRTC (saved 400ms), dropping to gpt-4o-mini (saved 350ms), and adding the latency monitor so we could see the remaining bottleneck in TTS warmup. Total: 280ms. Users stopped comparing it to a bad phone connection and started comparing it to a fast coworker.

Every millisecond has a home in the budget. The teams that build great voice agents are the ones that know exactly where each one goes.

Monitor every millisecond of your voice pipeline

Chanl tracks STT, LLM, and TTS latency per call, scores conversation quality automatically, and simulates hundreds of test calls before you go live.

Start building free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos

Frequently Asked Questions