KV Cache

Stored key/value representations of prior tokens used during autoregressive transformer decoding so each new token can attend to the previous context without recomputing every prior layer from scratch.

Key points

  • During decode, the new token runs through the model weights and attends to internal representations of previous tokens; those stored representations are the KV cache [src-042].
  • KV-cache access is mostly a memory-bandwidth problem rather than a matrix-multiply problem [src-042].
  • Unlike model weights, KV cache is unique to each sequence, so it cannot be amortized across a larger batch in the same way [src-042].
  • Long context raises serving cost because every decode step may need to fetch more KV-cache bytes from memory [src-042].
  • Keeping KV cache in memory saves compute, while deleting and rematerializing it saves storage at the cost of another forward pass [src-042].

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)