KV Cache Tiering
Serving strategy for deciding whether to keep KV-cache state in fast memory, move it to slower storage, or delete and rematerialize it later.
Key points
- Pope distinguishes between rematerializing KV cache from token IDs and storing previously computed KV cache in memory [src-042].
- Rematerialization saves storage but requires another forward pass through the model [src-042].
- KV cache can be stored in different memory tiers such as HBM, host DDR, flash, or slower disks, each with different holding costs and retrieval costs [src-042].
- Cache pricing durations, such as short-lived versus hour-long cache writes, can hint at which memory tier a provider is using [src-042].
- The right tier balances hold time against retrieval time: short-lived caches justify faster memory, while long-lived caches may move to cheaper slower tiers [src-042].
Related concepts
- KV Cache
- Prefill vs Decode
- Prompt Caching for Agents
- Claude Code Token Economics
- Memory Wall for Long Context
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)