Technique in AI video editing where transcription (Whisper or ElevenLabs) produces millisecond-precision timestamps per spoken word, used to trigger motion graphic animations at the exact spoken moment.
Key points
- Whisper and ElevenLabs API both produce word-level timestamps (millisecond precision) [012]
- HyperFrames and Remotion consume this timestamp JSON to fire each animated element at the correct word [012]
- Creates natural-feeling, non-mechanical motion graphics — animations arrive exactly when the word is spoken, not on a fixed timer [012]
- Critical dependency: without word-level timestamps, animations either fire too early/late or must be hardcoded per video (manual, not scalable) [012]
Related entities
- Whisper — primary transcription source
- Elevenlabs — alternative transcription backend
- Hyperframes — primary consumer of timestamp data
- Video Use — generates and passes the timestamp JSON
Related concepts
- Ai Video Editing Pipeline — the pipeline that depends on this technique
- Ai Avatar Content Pipeline — also uses word-level sync for avatar lip-sync timing
Source references
- [012] Nate Herk — Video editing & content creation cluster (2026-04-15 to 2026-04-23)