LLM Serving Batching

LLM Serving Batching

Serving optimization where multiple user sequences are decoded together so the cost of reading model weights is shared across many generated tokens.

Key points

  • Batching is the dominant lever behind fast-mode versus cheap-mode trade-offs in Pope’s serving model [src-042].
  • At batch size one, weight-fetch cost is barely amortized, so cost per token can be extremely high [src-042].
  • As batch size grows, weight-fetch cost per token falls until compute cost becomes the lower-bound cost [src-042].
  • A practical batch-size balance point can be approximated as roughly a hardware FLOPs-to-memory-bandwidth ratio times the model sparsity ratio [src-042].
  • Pope uses a train-schedule analogy: the system starts a new decode batch on a fixed cadence, such as roughly every 20 milliseconds, and requests board the next batch [src-042].

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

  1. Wiki concept Roofline Analysis for LLM Serving Back-of-the-envelope method for estimating LLM serving latency by comparing compute time against memory-fetch time. Related by serving
  2. Wiki concept Reiner Pope CEO of MatX and former Google TPU architecture contributor. In [src-042], Pope explains LLM serving and training from first principles: roofline analysis, batch-size economics Related by batching
  3. Insight AI Measurement and Experimentation How to measure AI product impact with evals, adoption metrics, online experiments, guardrails, and cost tracking Related by cost