Prefill vs Decode

Operational split in LLM serving between processing a block of input tokens in parallel and generating output tokens sequentially one at a time.

Key points

Decode processes one next token per sequence and is heavily exposed to memory bandwidth because the model repeatedly fetches weights and prior KV-cache data ^[src-042].
Prefill processes many input tokens in a pass, so memory costs can be divided over more tokens and the operation is more likely to be compute-limited ^[src-042].
The common API pattern where output tokens cost several times more than input tokens is consistent with decode being more memory-bandwidth constrained than prefill ^[src-042].
Tool calls, user messages, and file reads create new prefill segments inside a chat or agent session ^[src-042].
Understanding this split helps explain why prompt caching and cache hits can reduce price and latency: they avoid rematerializing already processed context ^[src-042].

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)