Gemini 3.1 Flash Live

Google’s streaming speech-to-speech voice model that replaces the classic STT to LLM to TTS pipeline with a single native speech model. Marketed as Google’s biggest voice upgrade, with lower latency, noise-robust listening, interruption handling and multimodal vision input.

Key facts

Native speech-to-speech: no intermediate text transcription step, so prosody, sarcasm and stress are preserved into the reasoning layer
Beats Gemini 2.5 Flash by ~19% on multi-step function calling and outperforms competitor models on the Audio Multi-Challenge benchmark
Supports multimodal vision — the agent can watch a webcam or share-screen feed and reason about what it sees
Over 70 supported languages, enabling real-time translation use cases
Free tier available in Google AI Studio with no API key; paid tier removes the ‘training on your data’ clause and raises rate limits
Pricing: roughly 14 cents per 10-minute call on the paid tier
Current limitation: stops speaking during function calls — cannot narrate over tool execution the way a prompted Vapi agent can
Deployment beyond Google AI Studio requires managing persistent websocket connections — less plug-and-play than ElevenLabs or Vapi for web embedding

Source references

^[src-007] Nate Herk cluster — Nate Herk — Voice AI agents cluster (4 videos)

– Videos referenced: Qt3zMBH-FNg

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

Key facts

Source references

Robin Cartier perspective

Keep reading from this thread

Robin Cartier

Company

Services