Live Voice Models
The emerging category of streaming audio models that replace or compress the classic STT to LLM to TTS voice-agent pipeline. Gemini 3.1 Flash Live is the flagship example from the Nate Herk voice cluster; OpenAI's GPT Realtime Translate, GPT Realtime Whisper, and GPT Realtime 2 extend the same category into API-native translation, streaming transcription, and action-taking voice agents.
Key points
- Eliminate the transcription hop: audio enters and exits the model as audio, preserving prosody, sarcasm and stress
- Enable natural interruption — the model stops talking as soon as the caller starts, without the 'game of chicken' silence typical of pipeline voice agents
- Robust to noisy environments (traffic, horns, restaurants)
- Typically support multimodal vision — screen share or webcam — enabling voice-driven computer use
- Earlier weakness: tool-call latency could be exposed as dead air when the model could not narrate while a function executed
- Newer realtime reasoning models address that weakness with preambles, parallel tool calls, larger context windows, and better state maintenance across turns [src-083].
- Competitive pressure on platform abstractions like Vapi and ElevenLabs
- OpenAI's GPT Realtime Translate demo adds live multilingual translation across 70 languages, including interruptions and technical terminology [src-051].
- OpenAI's GPT Realtime 2 demo shows a realtime voice model communicating during reasoning and parallel tool calling so the user stays informed while actions execute [src-051].
- OpenAI's Build Hour expands the model stack: GPT Realtime Whisper handles streaming transcription at roughly 200ms latency, GPT Realtime Translate covers more than 70 input languages and 13 output languages, and GPT Realtime 2 brings 128k context and GPT-5-class reasoning into voice agents [src-083].
- Sierra's production discussion shows that live voice models still need a surrounding Production Voice Agent Harness for turn-taking, simulations, traces, redaction, PCI-safe flows, and policy-grounded task completion [src-083].
- The AI Engineer corpus shows voice agents becoming an engineering domain: talks cover realtime voice AI, TTS data preparation, serving voice AI at low cost, turn-taking, interruption handling, voice-plus-vision, telemedicine support, and enterprise deployment timelines [src-077].
- The production bottleneck is often the system around the model: latency budgets, phone or browser integration, audio quality, tool-call pacing, monitoring, cost per hour, and fallback handling [src-077].
Related entities
- Gemini 3.1 Flash Live
- GPT Realtime Translate
- GPT Realtime Whisper
- GPT Realtime 2
- Sierra
- OpenAI
- Vapi
- ElevenLabs
Related concepts
- Voice Agents
- Voice Agent Preambles
- Voice-to-Action Interfaces
- Production Voice Agent Harness
- AI Engineering Discipline
- LLM Inference Economics
- Agent Security Boundaries
Source references
- [src-007] Nate Herk cluster — Nate Herk — Voice AI agents cluster (4 videos)
– Videos referenced: Qt3zMBH-FNg
- [src-051] OpenAI – "We’re introducing three audio models in the API" (2026-05-07)
- [src-077] AI Engineer channel transcript cluster (678 saved transcripts, 2023-10-20 to 2026-05-15)
- [src-083] OpenAI – "Build Hour: GPT-Realtime-2" (2026-05-13)
Recommended next
Keep reading from this thread
From 494 indexed pages and articles.
- Wiki concept GPT Realtime Whisper OpenAI's streaming speech-to-text model for low-latency transcription in realtime audio applications. Related by realtime
- Wiki concept GPT Realtime Translate An OpenAI realtime audio model for live speech translation in the OpenAI API, demonstrated translating spoken French into English Related by realtime
- Insight AI Beyond POCs How enterprise AI moves beyond proofs of concept through ownership, governance, measurement, adoption, and production operating models Readers have engaged with this next