LLM Capacity Engineering

LLM Capacity Engineering

LLM capacity engineering is the discipline of keeping AI applications reliable when model-provider rate limits, concurrency caps, retries, long loops, and tool fan-out become the production bottleneck.

Key points

  • Datadog found that rate limit errors were the most common LLM call failure in its observed customer traces [src-037].
  • In February 2026, 5 percent of LLM call spans reported an error and 60 percent of those errors were caused by exceeded rate limits; in March, rate limits still accounted for nearly a third of LLM errors [src-037].
  • Capacity problems are amplified by shared organization quotas, concurrency bursts, retry spikes, ReAct-style variable loops, and multi-agent collaboration [src-037].
  • Long-lived loops can hit provider rate limits or organizational caps, trigger retries, increase load, and turn a local capacity issue into a sustained system failure [src-037].
  • Datadog recommends queueing, backoff, fallback capacity, budget limits, and prompt/application design that avoids unnecessary loop length and tool fan-out [src-037].
  • Agent budgets should cap calls or tokens so loops terminate before runaway activity exhausts capacity or damages downstream services [src-037].
  • Pope’s serving model adds the provider-side capacity layer: practical throughput depends on batch cadence, memory bandwidth, active parameters, KV-cache size, and whether enough demand exists to fill efficient batches [src-042].
  • Inference capacity is not just “more GPUs”; scale-up domain size, all-to-all communication, and memory bandwidth determine which model shapes can be served at acceptable latency [src-042].
  • Next ’26 adds cloud-platform capacity primitives for agent workloads: TPU 8t/8i, Virgo Network, Managed Lustre, Rapid Storage, network-optimized compute, faster GKE inference scale-out, and agent sandboxes at 300 sandboxes per second per cluster [src-044].

Related entities

Related concepts

Source references

  • [src-037] Datadog — “State of AI Engineering” (2026-04-21)
  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)
  • [src-044] Thomas Kurian — “Welcome to Google Cloud Next ’26” (2026-04-22)