Live Voice Models

Live Voice Models

The emerging category of streaming audio models that replace or compress the classic STT to LLM to TTS voice-agent pipeline. Gemini 3.1 Flash Live is the flagship example from the Nate Herk voice cluster; OpenAI's GPT Realtime Translate, GPT Realtime Whisper, and GPT Realtime 2 extend the same category into API-native translation, streaming transcription, and action-taking voice agents.

Key points

  • Eliminate the transcription hop: audio enters and exits the model as audio, preserving prosody, sarcasm and stress
  • Enable natural interruption — the model stops talking as soon as the caller starts, without the 'game of chicken' silence typical of pipeline voice agents
  • Robust to noisy environments (traffic, horns, restaurants)
  • Typically support multimodal vision — screen share or webcam — enabling voice-driven computer use
  • Earlier weakness: tool-call latency could be exposed as dead air when the model could not narrate while a function executed
  • Newer realtime reasoning models address that weakness with preambles, parallel tool calls, larger context windows, and better state maintenance across turns [src-083].
  • Competitive pressure on platform abstractions like Vapi and ElevenLabs
  • OpenAI's GPT Realtime Translate demo adds live multilingual translation across 70 languages, including interruptions and technical terminology [src-051].
  • OpenAI's GPT Realtime 2 demo shows a realtime voice model communicating during reasoning and parallel tool calling so the user stays informed while actions execute [src-051].
  • OpenAI's Build Hour expands the model stack: GPT Realtime Whisper handles streaming transcription at roughly 200ms latency, GPT Realtime Translate covers more than 70 input languages and 13 output languages, and GPT Realtime 2 brings 128k context and GPT-5-class reasoning into voice agents [src-083].
  • Sierra's production discussion shows that live voice models still need a surrounding Production Voice Agent Harness for turn-taking, simulations, traces, redaction, PCI-safe flows, and policy-grounded task completion [src-083].
  • The AI Engineer corpus shows voice agents becoming an engineering domain: talks cover realtime voice AI, TTS data preparation, serving voice AI at low cost, turn-taking, interruption handling, voice-plus-vision, telemedicine support, and enterprise deployment timelines [src-077].
  • The production bottleneck is often the system around the model: latency budgets, phone or browser integration, audio quality, tool-call pacing, monitoring, cost per hour, and fallback handling [src-077].

Related entities

Related concepts

Source references

  • [src-007] Nate Herk cluster — Nate Herk — Voice AI agents cluster (4 videos)

– Videos referenced: Qt3zMBH-FNg

  • [src-051] OpenAI – "We’re introducing three audio models in the API" (2026-05-07)
  • [src-077] AI Engineer channel transcript cluster (678 saved transcripts, 2023-10-20 to 2026-05-15)
  • [src-083] OpenAI – "Build Hour: GPT-Realtime-2" (2026-05-13)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

  1. Wiki concept GPT Realtime Whisper OpenAI's streaming speech-to-text model for low-latency transcription in realtime audio applications. Related by realtime
  2. Wiki concept GPT Realtime Translate An OpenAI realtime audio model for live speech translation in the OpenAI API, demonstrated translating spoken French into English Related by realtime
  3. Insight AI Beyond POCs How enterprise AI moves beyond proofs of concept through ownership, governance, measurement, adoption, and production operating models Readers have engaged with this next