OpenAI Whisper

Speech-to-text model family used for video transcription and realtime speech recognition. In the wiki it first appears as the transcription backend for AI video editing, and later as GPT Realtime Whisper for low-latency streaming transcription.

Key facts

Type: Speech-to-text model
Maker: OpenAI (open-source)
Status: Active
Variants: OpenAI API (hosted), whisper.cpp (local — free, but RAM-intensive)
Output: Transcript text + word-level timestamps in milliseconds
Realtime variant: GPT Realtime Whisper is described by OpenAI as a streaming transcription model with tunable latency down to roughly 200ms and about 80 input languages ^[src-083].

Use in pipeline

video-use and HyperFrames both support Whisper as a transcription backend. The word-level timestamps it produces are passed to HyperFrames to trigger animation elements at the exact spoken moment. ^[src-012]

See also: OpenAI, video-use, HyperFrames, ElevenLabs
Concepts: AI Video Editing Pipeline, Word-Level Timestamp Sync, Live Voice Models, Voice Agents

Source references

^[src-012] Nate Herk — Video editing & content creation cluster (2026-04-15 to 2026-04-23)
^[src-083] OpenAI – "Build Hour: GPT-Realtime-2" (2026-05-13)

OpenAI Whisper

OpenAI Whisper

Key facts

Use in pipeline

Related

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services