Prefill vs Decode
Operational split in LLM serving between processing a block of input tokens in parallel and generating output tokens sequentially one at a time.
Key points
- Decode processes one next token per sequence and is heavily exposed to memory bandwidth because the model repeatedly fetches weights and prior KV-cache data [src-042].
- Prefill processes many input tokens in a pass, so memory costs can be divided over more tokens and the operation is more likely to be compute-limited [src-042].
- The common API pattern where output tokens cost several times more than input tokens is consistent with decode being more memory-bandwidth constrained than prefill [src-042].
- Tool calls, user messages, and file reads create new prefill segments inside a chat or agent session [src-042].
- Understanding this split helps explain why prompt caching and cache hits can reduce price and latency: they avoid rematerializing already processed context [src-042].
Related concepts
- KV Cache
- Prompt Caching for Agents
- KV Cache Tiering
- Claude Code Token Economics
- LLM Inference Economics
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)