Paired-Difference Model Evals
Paired-difference model evals compare models question by question on the same benchmark items, using the shared question list to reduce noise in the measured difference.
Key points
- Anthropic argues that eval scores are meaningful mainly in relation to other scores: one model beats, ties, or trails another [src-067].
- A two-sample test ignores the fact that model scores are usually collected on the same questions [src-067].
- Paired-difference analysis removes variance from question difficulty and focuses on variance in model responses [src-067].
- Anthropic reports that frontier-model question-score correlations on popular evals are often substantial, roughly 0.3 to 0.7, because models tend to get many of the same questions right or wrong [src-067].
- The practical recommendation is to report pairwise mean differences, standard errors, confidence intervals, and correlations whenever comparing two or more models [src-067].
Related entities
Related concepts
- Statistical Model Evaluations
- Question-Universe Eval Framing
- Practitioner Model Benchmarking Methodology
- Experiment Variance Reduction
- Statistical Significance Testing
Source references
- [src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)