OpenAI Whisper
Speech-to-text model family used for video transcription and realtime speech recognition. In the wiki it first appears as the transcription backend for AI video editing, and later as GPT Realtime Whisper for low-latency streaming transcription.
Key facts
- Type: Speech-to-text model
- Maker: OpenAI (open-source)
- Status: Active
- Variants: OpenAI API (hosted), whisper.cpp (local — free, but RAM-intensive)
- Output: Transcript text + word-level timestamps in milliseconds
- Realtime variant: GPT Realtime Whisper is described by OpenAI as a streaming transcription model with tunable latency down to roughly 200ms and about 80 input languages [src-083].
Use in pipeline
video-use and HyperFrames both support Whisper as a transcription backend. The word-level timestamps it produces are passed to HyperFrames to trigger animation elements at the exact spoken moment. [src-012]
Related
- See also: OpenAI, video-use, HyperFrames, ElevenLabs
- Concepts: AI Video Editing Pipeline, Word-Level Timestamp Sync, Live Voice Models, Voice Agents