Practitioner Model Benchmarking Methodology
Nate Herk’s framework for evaluating AI models with real agentic tasks, beyond official benchmarks.
Methodology
1. Design 3–5 identical, representative tasks (e.g., build a personal brand site, build a 3D game, simulate an ecosystem)
2. Run each task as a one-shot prompt across all models — no iteration, no follow-up
3. Capture from JSONL logs: wall-clock runtime, input tokens, output tokens, cost, tool call count
4. Assess output quality subjectively against the same rubric
5. Cross-reference with official benchmarks to triangulate [src-012]
Key insight: output token efficiency matters most
Output tokens are priced higher than input tokens. A model that produces the same result with 3.5x fewer output tokens (GPT 5.5 vs Opus 4.7) is significantly cheaper in practice even if nominal pricing looks similar. [src-012]
Official benchmarks used for cross-reference
- Terminal Bench 2.0 (agentic coding speed/accuracy)
- SWE-bench Verified (real GitHub issue resolution)
- GDPval, Frontier Math, Cyber Gym [src-012]
Statistical reporting layer
- Anthropic’s eval-statistics paper adds a formal layer that practitioner benchmarks often omit: every benchmark score should be treated as a noisy estimate with standard errors and confidence intervals, not a single deterministic capability fact [src-067].
- When two models are run on the same questions, Paired-Difference Model Evals should report mean differences, standard errors, confidence intervals, and score correlations rather than only two independent headline scores [src-067].
- If benchmark items are grouped by passage, task, repository, or another shared unit, Clustered Standard Errors in Evals are needed to avoid overconfident conclusions [src-067].
- Power analysis can decide whether a benchmark is large enough to detect the size of model gap the evaluator actually cares about [src-067].
Related entities
- GPT 5.5 — April 2026 benchmark subject vs Opus 4.7
- Claude Opus 4.7 — April 2026 benchmark subject
- GPT Image 2, Imagen 3 (Nano Banana 2) — image model benchmark subjects
- Anthropic — source for the statistical eval-reporting recommendations
Related concepts
- Claude Code Token Economics — cost model underpinning the benchmark analysis
- Statistical Model Evaluations
- Question-Universe Eval Framing
- Paired-Difference Model Evals
- Clustered Standard Errors in Evals
- Experiment Statistical Power