Google’s streaming speech-to-speech voice model that replaces the classic STT to LLM to TTS pipeline with a single native speech model. Marketed as Google’s biggest voice upgrade, with lower latency, noise-robust listening, interruption handling and multimodal vision input.
Key facts
- Native speech-to-speech: no intermediate text transcription step, so prosody, sarcasm and stress are preserved into the reasoning layer
- Beats Gemini 2.5 Flash by ~19% on multi-step function calling and outperforms competitor models on the Audio Multi-Challenge benchmark
- Supports multimodal vision — the agent can watch a webcam or share-screen feed and reason about what it sees
- Over 70 supported languages, enabling real-time translation use cases
- Free tier available in Google AI Studio with no API key; paid tier removes the ‘training on your data’ clause and raises rate limits
- Pricing: roughly 14 cents per 10-minute call on the paid tier
- Current limitation: stops speaking during function calls — cannot narrate over tool execution the way a prompted Vapi agent can
- Deployment beyond Google AI Studio requires managing persistent websocket connections — less plug-and-play than ElevenLabs or Vapi for web embedding
Source references
- [src-007] Nate Herk cluster — Nate Herk — Voice AI agents cluster (4 videos)
– Videos referenced: Qt3zMBH-FNg
Recommended next
Keep reading from this thread
From 494 indexed pages and articles.
- Wiki concept Live Voice Models The emerging category of streaming audio models that replace or compress the classic STT to LLM to TTS voice-agent pipeline. Related by live
- Wiki concept Google AI Studio Google's web-based playground for testing Gemini models including Gemini 3.1 Flash Live and Gemini Robotics-ER 1.6. Related by gemini
- Insight AI Beyond POCs How enterprise AI moves beyond proofs of concept through ownership, governance, measurement, adoption, and production operating models Readers have engaged with this next