Memory Wall for Long Context
Infrastructure constraint where extending LLM context length is limited more by KV-cache memory bandwidth and capacity than by raw compute.
Key points
- Pope argues that very long context is primarily constrained by memory bandwidth and memory capacity, not by the quadratic compute term, at practical context lengths [src-042].
- Dense attention makes KV-cache fetch cost grow roughly linearly with context length during decode [src-042].
- Sparse attention can improve the scaling, but going too sparse eventually harms quality because the model attends to too small a subset of prior tokens [src-042].
- The plateau around 100K-200K context windows is interpreted as a sign that providers are near a balanced cost point, while million- or hundred-million-token contexts remain cost-prohibitive without a major memory-side change [src-042].
- This hardware limit matters for agentic systems because in-context learning as long-term work memory would require much longer contexts than today’s economically balanced range [src-042].
Related concepts
- KV Cache
- Claude Code Context Management Discipline
- Context Quality Engineering
- LLM Inference Economics
- Prefill vs Decode
Source references
- [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)