Memory Wall for Long Context

Infrastructure constraint where extending LLM context length is limited more by KV-cache memory bandwidth and capacity than by raw compute.

Key points

Pope argues that very long context is primarily constrained by memory bandwidth and memory capacity, not by the quadratic compute term, at practical context lengths ^[src-042].
Dense attention makes KV-cache fetch cost grow roughly linearly with context length during decode ^[src-042].
Sparse attention can improve the scaling, but going too sparse eventually harms quality because the model attends to too small a subset of prior tokens ^[src-042].
The plateau around 100K-200K context windows is interpreted as a sign that providers are near a balanced cost point, while million- or hundred-million-token contexts remain cost-prohibitive without a major memory-side change ^[src-042].
This hardware limit matters for agentic systems because in-context learning as long-term work memory would require much longer contexts than today’s economically balanced range ^[src-042].

^[src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)