LLM Serving Batching

Serving optimization where multiple user sequences are decoded together so the cost of reading model weights is shared across many generated tokens.

Key points

Batching is the dominant lever behind fast-mode versus cheap-mode trade-offs in Pope’s serving model ^[src-042].
At batch size one, weight-fetch cost is barely amortized, so cost per token can be extremely high ^[src-042].
As batch size grows, weight-fetch cost per token falls until compute cost becomes the lower-bound cost ^[src-042].
A practical batch-size balance point can be approximated as roughly a hardware FLOPs-to-memory-bandwidth ratio times the model sparsity ratio ^[src-042].
Pope uses a train-schedule analogy: the system starts a new decode batch on a fixed cadence, such as roughly every 20 milliseconds, and requests board the next batch ^[src-042].

Related concepts

Source references

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

LLM Serving Batching

LLM Serving Batching

Key points

Related concepts

Source references

Robin Cartier perspective

Keep reading from this thread

Robin Cartier

Company

Services