World Action Models

World Action Models

World action models are robotics policy models that jointly predict near-future world states and robot actions, using video/world-model pretraining as the physical analogue of next-token prediction.

Key points

  • Fan argues that vision-language-action models are too language-heavy: they encode knowledge and nouns well, but are weaker at physics and verbs [src-082].
  • Video world models suggest a different pretraining target: simulate the next physical world state in pixels, learning gravity, lighting, refraction, buoyancy, and visual planning from scale [src-082].
  • The problem is that video models can also produce "physics slop" when the simulation is visually plausible but physically wrong, so robotics needs action alignment on top of world simulation [src-082].
  • Dream Zero is Fan's example of a world/action policy: it "dreams" a few seconds ahead, jointly decodes future video and high-dimensional motor actions, then acts accordingly [src-082].
  • In this framing, vision and action become first-class citizens rather than language being the dominant parameter budget [src-082].
  • Fan positions world action models as the robotics analogue of the LLM stack: pretraining learns a broad simulator, action fine-tuning selects useful robot futures, and reinforcement learning carries the last mile [src-082].

Related entities

Related concepts

Source references

  • [src-082] Sequoia Capital — "Robotics' End Game: Nvidia's Jim Fan" (2026-04-30)