GPT Realtime Whisper

GPT Realtime Whisper

GPT Realtime Whisper is OpenAI's streaming speech-to-text model for low-latency transcription in realtime audio applications.

Key facts

  • Type: Streaming speech-to-text model
  • Maker: OpenAI
  • First seen in wiki: OpenAI's Build Hour on GPT Realtime 2 [src-083]
  • Latency: OpenAI describes the model as tunable down to roughly 200ms latency for realtime captions and voice-agent input [src-083].
  • Language coverage: The session describes support for about 80 input languages [src-083].
  • Role in stack: It is positioned between classic batch transcription and full speech-to-speech models: still transcription-first, but fast enough to drive captions, meeting notes, ambient context, and earlier tool calls [src-083].

What it does

GPT Realtime Whisper gives developers a streaming transcription layer when they need text quickly but do not necessarily need a full voice-to-voice model. OpenAI connects it to realtime captions, meeting notes, and voice-agent systems where earlier recognition lets the application prepare tool calls or context before the speaker has finished [src-083].

Related

Source references

  • [src-083] OpenAI – "Build Hour: GPT-Realtime-2" (2026-05-13)