World Action Models

World action models are robotics policy models that jointly predict near-future world states and robot actions, using video/world-model pretraining as the physical analogue of next-token prediction.

Key points

Fan argues that vision-language-action models are too language-heavy: they encode knowledge and nouns well, but are weaker at physics and verbs ^[src-082].
Video world models suggest a different pretraining target: simulate the next physical world state in pixels, learning gravity, lighting, refraction, buoyancy, and visual planning from scale ^[src-082].
The problem is that video models can also produce "physics slop" when the simulation is visually plausible but physically wrong, so robotics needs action alignment on top of world simulation ^[src-082].
Dream Zero is Fan's example of a world/action policy: it "dreams" a few seconds ahead, jointly decodes future video and high-dimensional motor actions, then acts accordingly ^[src-082].
In this framing, vision and action become first-class citizens rather than language being the dominant parameter budget ^[src-082].
Fan positions world action models as the robotics analogue of the LLM stack: pretraining learns a broad simulator, action fine-tuning selects useful robot futures, and reinforcement learning carries the last mile ^[src-082].

Related entities

Related concepts

Source references

^[src-082] Sequoia Capital — "Robotics' End Game: Nvidia's Jim Fan" (2026-04-30)

World Action Models

World Action Models

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services