Real-World AI Task Horizons
Real-world AI task horizons measure how AI success rates decline as user-chosen tasks require more human time, capturing effective capability in deployed usage rather than only controlled benchmarks.
Key points
- Anthropic relates its task-success primitive to task-horizon work such as METR’s measurement of how long a task an AI can reliably complete [src-069, src-070].
- In first-party API data, success rates fall from around 60% for sub-hour tasks to roughly 45% for tasks estimated at 5+ human hours [src-069].
- The API fitted line reaches 50% success at about 3.5 human hours, while Claude.ai extrapolates to about 19 hours because multi-turn conversations let users decompose and correct work [src-069, src-070].
- Real-world task horizons mix model capability with user selection, setup cost, and user judgment about what is worth bringing to Claude [src-069, src-070].
- Controlled benchmarks measure autonomous frontier capability; real-world usage measures effective task horizon across broader, user-selected work [src-069, src-070].
- Anthropic’s scientific-computing case is a concrete long-horizon example: Claude Code worked over several days on a specialized numerical solver, using persistent memory, test oracles, Git coordination, and occasional steering [src-072].
- The case distinguishes long-horizon work that can be autonomously pursued because progress is measurable from open-ended scientific discovery where human judgment remains central [src-072].
Related entities
Related concepts
- Economic Primitives
- Effective AI Job Coverage
- Human-Agent Collaboration
- Practitioner Model Benchmarking Methodology
- Statistical Model Evaluations
- Long-Running Scientific Agents
- Test Oracle Driven Agents