GPT Realtime Whisper
GPT Realtime Whisper is OpenAI's streaming speech-to-text model for low-latency transcription in realtime audio applications.
Key facts
- Type: Streaming speech-to-text model
- Maker: OpenAI
- First seen in wiki: OpenAI's Build Hour on GPT Realtime 2 [src-083]
- Latency: OpenAI describes the model as tunable down to roughly 200ms latency for realtime captions and voice-agent input [src-083].
- Language coverage: The session describes support for about 80 input languages [src-083].
- Role in stack: It is positioned between classic batch transcription and full speech-to-speech models: still transcription-first, but fast enough to drive captions, meeting notes, ambient context, and earlier tool calls [src-083].
What it does
GPT Realtime Whisper gives developers a streaming transcription layer when they need text quickly but do not necessarily need a full voice-to-voice model. OpenAI connects it to realtime captions, meeting notes, and voice-agent systems where earlier recognition lets the application prepare tool calls or context before the speaker has finished [src-083].
Related
- See also: OpenAI, OpenAI Whisper, GPT Realtime 2, GPT Realtime Translate
- Concepts: Live Voice Models, Voice Agents
Source references
- [src-083] OpenAI – "Build Hour: GPT-Realtime-2" (2026-05-13)