Question-Universe Eval Framing
Question-universe eval framing treats a benchmark’s questions as a sample drawn from a broader universe of possible questions with a similar difficulty distribution.
Key points
- Anthropic says the object of interest is not the observed average on one benchmark but the theoretical average across all possible questions of that type [src-067].
- This framing separates model skill from the luck of drawing easier or harder questions in a particular benchmark [src-067].
- Under the Central Limit Theorem, repeated benchmark samples from the same universe would have means that approximate a normal distribution around the theoretical mean [src-067].
- Reporting standard error and confidence intervals makes the implied question-universe uncertainty visible [src-067].
- The framing only works cleanly when the sampling assumptions are plausible; clustered questions require different uncertainty estimates [src-067].
Related entities
Related concepts
- Statistical Model Evaluations
- Clustered Standard Errors in Evals
- Paired-Difference Model Evals
- Practitioner Model Benchmarking Methodology
Source references
- [src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)