KV Cache Tiering

Serving strategy for deciding whether to keep KV-cache state in fast memory, move it to slower storage, or delete and rematerialize it later.

Key points

Pope distinguishes between rematerializing KV cache from token IDs and storing previously computed KV cache in memory ^[src-042].
Rematerialization saves storage but requires another forward pass through the model ^[src-042].
KV cache can be stored in different memory tiers such as HBM, host DDR, flash, or slower disks, each with different holding costs and retrieval costs ^[src-042].
Cache pricing durations, such as short-lived versus hour-long cache writes, can hint at which memory tier a provider is using ^[src-042].
The right tier balances hold time against retrieval time: short-lived caches justify faster memory, while long-lived caches may move to cheaper slower tiers ^[src-042].

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)