KV Cache

Stored key/value representations of prior tokens used during autoregressive transformer decoding so each new token can attend to the previous context without recomputing every prior layer from scratch.

Key points

During decode, the new token runs through the model weights and attends to internal representations of previous tokens; those stored representations are the KV cache ^[src-042].
KV-cache access is mostly a memory-bandwidth problem rather than a matrix-multiply problem ^[src-042].
Unlike model weights, KV cache is unique to each sequence, so it cannot be amortized across a larger batch in the same way ^[src-042].
Long context raises serving cost because every decode step may need to fetch more KV-cache bytes from memory ^[src-042].
Keeping KV cache in memory saves compute, while deleting and rematerializing it saves storage at the cost of another forward pass ^[src-042].

Related concepts

Source references

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)

KV Cache

KV Cache

Key points

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services