Prefill vs Decode

Prefill vs Decode

Operational split in LLM serving between processing a block of input tokens in parallel and generating output tokens sequentially one at a time.

Key points

  • Decode processes one next token per sequence and is heavily exposed to memory bandwidth because the model repeatedly fetches weights and prior KV-cache data [src-042].
  • Prefill processes many input tokens in a pass, so memory costs can be divided over more tokens and the operation is more likely to be compute-limited [src-042].
  • The common API pattern where output tokens cost several times more than input tokens is consistent with decode being more memory-bandwidth constrained than prefill [src-042].
  • Tool calls, user messages, and file reads create new prefill segments inside a chat or agent session [src-042].
  • Understanding this split helps explain why prompt caching and cache hits can reduce price and latency: they avoid rematerializing already processed context [src-042].

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)