Roofline Analysis for LLM Serving

Roofline Analysis for LLM Serving

Back-of-the-envelope method for estimating LLM serving latency by comparing compute time against memory-fetch time.

Key points

  • Pope models decode time as at least the maximum of compute time and memory time [src-042].
  • Compute time scales with batch size and active parameters divided by chip FLOPs [src-042].
  • Memory time includes weight fetches over total parameters plus KV-cache fetches over batch size, context length, and bytes per token, divided by memory bandwidth [src-042].
  • This simple model explains why serving can shift between compute-bound and memory-bound regimes as batch size or context length changes [src-042].
  • The model also explains why the “Goldilocks” context length matters: beyond the balance point, dense attention makes KV-cache memory bandwidth dominate [src-042].

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)