KV Cache Tiering

KV Cache Tiering

Serving strategy for deciding whether to keep KV-cache state in fast memory, move it to slower storage, or delete and rematerialize it later.

Key points

  • Pope distinguishes between rematerializing KV cache from token IDs and storing previously computed KV cache in memory [src-042].
  • Rematerialization saves storage but requires another forward pass through the model [src-042].
  • KV cache can be stored in different memory tiers such as HBM, host DDR, flash, or slower disks, each with different holding costs and retrieval costs [src-042].
  • Cache pricing durations, such as short-lived versus hour-long cache writes, can hint at which memory tier a provider is using [src-042].
  • The right tier balances hold time against retrieval time: short-lived caches justify faster memory, while long-lived caches may move to cheaper slower tiers [src-042].

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)