World Action Models
World action models are robotics policy models that jointly predict near-future world states and robot actions, using video/world-model pretraining as the physical analogue of next-token prediction.
Key points
- Fan argues that vision-language-action models are too language-heavy: they encode knowledge and nouns well, but are weaker at physics and verbs [src-082].
- Video world models suggest a different pretraining target: simulate the next physical world state in pixels, learning gravity, lighting, refraction, buoyancy, and visual planning from scale [src-082].
- The problem is that video models can also produce "physics slop" when the simulation is visually plausible but physically wrong, so robotics needs action alignment on top of world simulation [src-082].
- Dream Zero is Fan's example of a world/action policy: it "dreams" a few seconds ahead, jointly decodes future video and high-dimensional motor actions, then acts accordingly [src-082].
- In this framing, vision and action become first-class citizens rather than language being the dominant parameter budget [src-082].
- Fan positions world action models as the robotics analogue of the LLM stack: pretraining learns a broad simulator, action fine-tuning selects useful robot futures, and reinforcement learning carries the last mile [src-082].
Related entities
Related concepts
- World Models
- Embodied Reasoning
- Physical AI
- Robotics Data Loop
- LLMs In Robotics
- Intuitive Physics In AI
Source references
- [src-082] Sequoia Capital — "Robotics' End Game: Nvidia's Jim Fan" (2026-04-30)