Mixture-of-Experts Serving

Mixture-of-Experts Serving

Serving architecture for sparse models where a router sends each token to a subset of expert MLPs, reducing active compute while increasing total parameters and communication complexity.

Key points

  • Pope uses DeepSeek-style sparse MoE as the running example: many total parameters but only a subset active for each generated token [src-042].
  • Higher sparsity reduces active compute, but increases total parameters and memory capacity requirements [src-042].
  • Expert parallelism maps different experts to different GPUs, making the traffic pattern all-to-all across the scale-up domain [src-042].
  • A single rack with full all-to-all connectivity is a natural fit for MoE serving; crossing rack boundaries introduces slower scale-out links [src-042].
  • Smaller experts reduce the usefulness of tensor parallelism, making expert parallelism and limited pipeline parallelism the main serving strategies [src-042].

Related entities

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)