LLM Serving Batching
Serving optimization where multiple user sequences are decoded together so the cost of reading model weights is shared across many generated tokens.
Key points
- Batching is the dominant lever behind fast-mode versus cheap-mode trade-offs in Pope’s serving model [src-042].
- At batch size one, weight-fetch cost is barely amortized, so cost per token can be extremely high [src-042].
- As batch size grows, weight-fetch cost per token falls until compute cost becomes the lower-bound cost [src-042].
- A practical batch-size balance point can be approximated as roughly a hardware FLOPs-to-memory-bandwidth ratio times the model sparsity ratio [src-042].
- Pope uses a train-schedule analogy: the system starts a new decode batch on a fixed cadence, such as roughly every 20 milliseconds, and requests board the next batch [src-042].
Related concepts
- LLM Inference Economics
- Roofline Analysis for LLM Serving
- Prefill vs Decode
- LLM Capacity Engineering
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)
Recommended next
Keep reading from this thread
From 494 indexed pages and articles.
- Wiki concept Roofline Analysis for LLM Serving Back-of-the-envelope method for estimating LLM serving latency by comparing compute time against memory-fetch time. Related by serving
- Wiki concept Reiner Pope CEO of MatX and former Google TPU architecture contributor. In [src-042], Pope explains LLM serving and training from first principles: roofline analysis, batch-size economics Related by batching
- Insight AI Measurement and Experimentation How to measure AI product impact with evals, adoption metrics, online experiments, guardrails, and cost tracking Related by cost