LLM Parallelism Strategies
Ways to split large model training or serving across hardware dimensions such as experts, layers, tensors, data batches, and pipeline stages.
Key points
- Pope emphasizes that useful parallelism often follows the model’s own axes: experts can be split across GPUs, layers across racks, and data across replicas [src-042].
- Expert parallelism is a strong fit for sparse MoE serving because each expert can live on different GPUs inside a scale-up domain [src-042].
- Pipeline parallelism splits layers across racks and can reduce weight memory per rack, but adds complexity and can create bubbles in training [src-042].
- In inference, pipelining is mostly neutral for latency and helps weight capacity more than KV-cache capacity, because more pipeline stages also require more in-flight micro-batches [src-042].
- Tensor parallelism is less attractive when experts are small, because there is less benefit in slicing inside a single expert [src-042].
Related concepts
- Mixture-of-Experts Serving
- Scale-Up vs Scale-Out Networking
- LLM Inference Economics
- Training-Inference Compute Balance
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)
Recommended next
Keep reading from this thread
From 494 indexed pages and articles.
- Wiki concept Mixture-of-Experts Serving Serving architecture for sparse models where a router sends each token to a subset of expert MLPs, reducing active compute while increasing total parameters Related by parallelism
- Wiki concept Nvidia Blackwell NVL72 Rack-scale Nvidia GPU system used in [src-042] as the running example for LLM roofline analysis. Related by experts
- Insight AI Measurement and Experimentation How to measure AI product impact with evals, adoption metrics, online experiments, guardrails, and cost tracking Readers have engaged with this next