Scale-Up vs Scale-Out Networking
Distinction between fast intra-rack accelerator communication and slower inter-rack or data-center communication in AI clusters.
Key points
- In Pope’s explanation, scale-up networking connects GPUs inside a rack with high-bandwidth all-to-all connectivity, while scale-out networking connects racks through slower data-center fabrics [src-042].
- MoE all-to-all traffic is well matched to scale-up networks because any GPU may need to send tokens to experts on any other GPU [src-042].
- Crossing rack boundaries can bottleneck MoE traffic because a large share of tokens may need slower scale-out links [src-042].
- Larger scale-up domains matter not only for capacity but also for effective memory bandwidth: more GPUs can read model weights in parallel during decode [src-042].
- Physical rack constraints such as cabling density, bend radius, power, cooling, weight, and backplane design limit how large scale-up domains can become [src-042].
- [src-061] adds a training-scale reliability angle: once runs involve 10,000 to 100,000 GPUs, component failures are expected and cluster software must handle redundancy as a normal condition.
Related entities
Related concepts
- Mixture-of-Experts Serving
- LLM Parallelism Strategies
- LLM Inference Economics
- GPU Supply as AI Strategy
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)
- [src-061] Lex Fridman – “State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490” (2026-01-31)