Live Voice Models

The emerging category of streaming audio models that replace or compress the classic STT to LLM to TTS voice-agent pipeline. Gemini 3.1 Flash Live is the flagship example from the Nate Herk voice cluster; OpenAI's GPT Realtime Translate, GPT Realtime Whisper, and GPT Realtime 2 extend the same category into API-native translation, streaming transcription, and action-taking voice agents.

Key points

Eliminate the transcription hop: audio enters and exits the model as audio, preserving prosody, sarcasm and stress
Enable natural interruption — the model stops talking as soon as the caller starts, without the 'game of chicken' silence typical of pipeline voice agents
Robust to noisy environments (traffic, horns, restaurants)
Typically support multimodal vision — screen share or webcam — enabling voice-driven computer use
Earlier weakness: tool-call latency could be exposed as dead air when the model could not narrate while a function executed
Newer realtime reasoning models address that weakness with preambles, parallel tool calls, larger context windows, and better state maintenance across turns ^[src-083].
Competitive pressure on platform abstractions like Vapi and ElevenLabs
OpenAI's GPT Realtime Translate demo adds live multilingual translation across 70 languages, including interruptions and technical terminology ^[src-051].
OpenAI's GPT Realtime 2 demo shows a realtime voice model communicating during reasoning and parallel tool calling so the user stays informed while actions execute ^[src-051].
OpenAI's Build Hour expands the model stack: GPT Realtime Whisper handles streaming transcription at roughly 200ms latency, GPT Realtime Translate covers more than 70 input languages and 13 output languages, and GPT Realtime 2 brings 128k context and GPT-5-class reasoning into voice agents ^[src-083].
Sierra's production discussion shows that live voice models still need a surrounding Production Voice Agent Harness for turn-taking, simulations, traces, redaction, PCI-safe flows, and policy-grounded task completion ^[src-083].
The AI Engineer corpus shows voice agents becoming an engineering domain: talks cover realtime voice AI, TTS data preparation, serving voice AI at low cost, turn-taking, interruption handling, voice-plus-vision, telemedicine support, and enterprise deployment timelines ^[src-077].
The production bottleneck is often the system around the model: latency budgets, phone or browser integration, audio quality, tool-call pacing, monitoring, cost per hour, and fallback handling ^[src-077].

Related entities

Related concepts

Source references

^[src-007] Nate Herk cluster — Nate Herk — Voice AI agents cluster (4 videos)

– Videos referenced: Qt3zMBH-FNg

^[src-051] OpenAI – "We’re introducing three audio models in the API" (2026-05-07)
^[src-077] AI Engineer channel transcript cluster (678 saved transcripts, 2023-10-20 to 2026-05-15)
^[src-083] OpenAI – "Build Hour: GPT-Realtime-2" (2026-05-13)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

Live Voice Models

Live Voice Models

Key points

Related entities

Related concepts

Source references

Robin Cartier perspective

Keep reading from this thread

Robin Cartier

Company

Services