Roofline Analysis for LLM Serving

Back-of-the-envelope method for estimating LLM serving latency by comparing compute time against memory-fetch time.

Key points

Pope models decode time as at least the maximum of compute time and memory time ^[src-042].
Compute time scales with batch size and active parameters divided by chip FLOPs ^[src-042].
Memory time includes weight fetches over total parameters plus KV-cache fetches over batch size, context length, and bytes per token, divided by memory bandwidth ^[src-042].
This simple model explains why serving can shift between compute-bound and memory-bound regimes as batch size or context length changes ^[src-042].
The model also explains why the “Goldilocks” context length matters: beyond the balance point, dense attention makes KV-cache memory bandwidth dominate ^[src-042].

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)