Real-World AI Task Horizons

Real-World AI Task Horizons

Real-world AI task horizons measure how AI success rates decline as user-chosen tasks require more human time, capturing effective capability in deployed usage rather than only controlled benchmarks.

Key points

  • Anthropic relates its task-success primitive to task-horizon work such as METR’s measurement of how long a task an AI can reliably complete [src-069, src-070].
  • In first-party API data, success rates fall from around 60% for sub-hour tasks to roughly 45% for tasks estimated at 5+ human hours [src-069].
  • The API fitted line reaches 50% success at about 3.5 human hours, while Claude.ai extrapolates to about 19 hours because multi-turn conversations let users decompose and correct work [src-069, src-070].
  • Real-world task horizons mix model capability with user selection, setup cost, and user judgment about what is worth bringing to Claude [src-069, src-070].
  • Controlled benchmarks measure autonomous frontier capability; real-world usage measures effective task horizon across broader, user-selected work [src-069, src-070].
  • Anthropic’s scientific-computing case is a concrete long-horizon example: Claude Code worked over several days on a specialized numerical solver, using persistent memory, test oracles, Git coordination, and occasional steering [src-072].
  • The case distinguishes long-horizon work that can be autonomously pursued because progress is measurable from open-ended scientific discovery where human judgment remains central [src-072].

Related entities

Related concepts

Source references

  • [src-069] Anthropic – “Anthropic Economic Index report: Economic primitives” (2026-01-15)
  • [src-070] Anthropic – “Anthropic Economic Index: New building blocks for understanding AI use” (2026-01-15)
  • [src-072] Siddharth Mishra-Sharma – “Long-running Claude for scientific computing” (2026-03-23)