Mixture-of-Experts Serving

Serving architecture for sparse models where a router sends each token to a subset of expert MLPs, reducing active compute while increasing total parameters and communication complexity.

Key points

Pope uses DeepSeek-style sparse MoE as the running example: many total parameters but only a subset active for each generated token ^[src-042].
Higher sparsity reduces active compute, but increases total parameters and memory capacity requirements ^[src-042].
Expert parallelism maps different experts to different GPUs, making the traffic pattern all-to-all across the scale-up domain ^[src-042].
A single rack with full all-to-all connectivity is a natural fit for MoE serving; crossing rack boundaries introduces slower scale-out links ^[src-042].
Smaller experts reduce the usefulness of tensor parallelism, making expert parallelism and limited pipeline parallelism the main serving strategies ^[src-042].

Related entities

Nvidia Blackwell NVL72

Related concepts

Source references

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)

Mixture-of-Experts Serving

Mixture-of-Experts Serving

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services