KV Cache Tiering

Serving strategy for deciding whether to keep KV-cache state in fast memory, move it to slower storage, or delete and rematerialize it later.

Key points

Pope distinguishes between rematerializing KV cache from token IDs and storing previously computed KV cache in memory ^[src-042].
Rematerialization saves storage but requires another forward pass through the model ^[src-042].
KV cache can be stored in different memory tiers such as HBM, host DDR, flash, or slower disks, each with different holding costs and retrieval costs ^[src-042].
Cache pricing durations, such as short-lived versus hour-long cache writes, can hint at which memory tier a provider is using ^[src-042].
The right tier balances hold time against retrieval time: short-lived caches justify faster memory, while long-lived caches may move to cheaper slower tiers ^[src-042].

Related concepts

Source references

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 491 indexed pages and articles.

KV Cache Tiering

KV Cache Tiering

Key points

Related concepts

Source references

Robin Cartier perspective

Keep reading from this thread

Robin Cartier

Company

Services