Mixture-of-Experts Serving
Serving architecture for sparse models where a router sends each token to a subset of expert MLPs, reducing active compute while increasing total parameters and communication complexity.
Key points
- Pope uses DeepSeek-style sparse MoE as the running example: many total parameters but only a subset active for each generated token [src-042].
- Higher sparsity reduces active compute, but increases total parameters and memory capacity requirements [src-042].
- Expert parallelism maps different experts to different GPUs, making the traffic pattern all-to-all across the scale-up domain [src-042].
- A single rack with full all-to-all connectivity is a natural fit for MoE serving; crossing rack boundaries introduces slower scale-out links [src-042].
- Smaller experts reduce the usefulness of tensor parallelism, making expert parallelism and limited pipeline parallelism the main serving strategies [src-042].
Related entities
Related concepts
- LLM Parallelism Strategies
- Scale-Up vs Scale-Out Networking
- LLM Inference Economics
- Roofline Analysis for LLM Serving
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)