Word-Level Timestamp Sync

Technique in AI video editing where transcription (Whisper or ElevenLabs) produces millisecond-precision timestamps per spoken word, used to trigger motion graphic animations at the exact spoken moment.

Key points

  • Whisper and ElevenLabs API both produce word-level timestamps (millisecond precision) [012]
  • HyperFrames and Remotion consume this timestamp JSON to fire each animated element at the correct word [012]
  • Creates natural-feeling, non-mechanical motion graphics — animations arrive exactly when the word is spoken, not on a fixed timer [012]
  • Critical dependency: without word-level timestamps, animations either fire too early/late or must be hardcoded per video (manual, not scalable) [012]

Related entities

  • Whisper — primary transcription source
  • Elevenlabs — alternative transcription backend
  • Hyperframes — primary consumer of timestamp data
  • Video Use — generates and passes the timestamp JSON

Related concepts

Source references

  • [012] Nate Herk — Video editing & content creation cluster (2026-04-15 to 2026-04-23)