Memory Wall for Long Context

Memory Wall for Long Context

Infrastructure constraint where extending LLM context length is limited more by KV-cache memory bandwidth and capacity than by raw compute.

Key points

  • Pope argues that very long context is primarily constrained by memory bandwidth and memory capacity, not by the quadratic compute term, at practical context lengths [src-042].
  • Dense attention makes KV-cache fetch cost grow roughly linearly with context length during decode [src-042].
  • Sparse attention can improve the scaling, but going too sparse eventually harms quality because the model attends to too small a subset of prior tokens [src-042].
  • The plateau around 100K-200K context windows is interpreted as a sign that providers are near a balanced cost point, while million- or hundred-million-token contexts remain cost-prohibitive without a major memory-side change [src-042].
  • This hardware limit matters for agentic systems because in-context learning as long-term work memory would require much longer contexts than today’s economically balanced range [src-042].

Related concepts

Source references

  • [src-042] Dwarkesh Patel — “How GPT, Claude, and Gemini are actually trained and served – Reiner Pope” (2026-04-29)