Roofline Analysis for LLM Serving
Back-of-the-envelope method for estimating LLM serving latency by comparing compute time against memory-fetch time.
Key points
- Pope models decode time as at least the maximum of compute time and memory time [src-042].
- Compute time scales with batch size and active parameters divided by chip FLOPs [src-042].
- Memory time includes weight fetches over total parameters plus KV-cache fetches over batch size, context length, and bytes per token, divided by memory bandwidth [src-042].
- This simple model explains why serving can shift between compute-bound and memory-bound regimes as batch size or context length changes [src-042].
- The model also explains why the “Goldilocks” context length matters: beyond the balance point, dense attention makes KV-cache memory bandwidth dominate [src-042].
Related concepts
- LLM Inference Economics
- LLM Serving Batching
- KV Cache
- Memory Wall for Long Context
- Mixture-of-Experts Serving
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)