KV Cache
Stored key/value representations of prior tokens used during autoregressive transformer decoding so each new token can attend to the previous context without recomputing every prior layer from scratch.
Key points
- During decode, the new token runs through the model weights and attends to internal representations of previous tokens; those stored representations are the KV cache [src-042].
- KV-cache access is mostly a memory-bandwidth problem rather than a matrix-multiply problem [src-042].
- Unlike model weights, KV cache is unique to each sequence, so it cannot be amortized across a larger batch in the same way [src-042].
- Long context raises serving cost because every decode step may need to fetch more KV-cache bytes from memory [src-042].
- Keeping KV cache in memory saves compute, while deleting and rematerializing it saves storage at the cost of another forward pass [src-042].
Related concepts
- Prefill vs Decode
- KV Cache Tiering
- Memory Wall for Long Context
- Prompt Caching for Agents
- Claude Code Context Management Discipline
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)