Clustered Standard Errors in Evals
Clustered standard errors in evals adjust uncertainty estimates when benchmark questions are grouped around shared passages, tasks, or other non-independent units.
Key points
- Anthropic notes that many reading-comprehension evals include several questions about the same passage, so the questions are not independent samples from the question universe [src-067].
- A naive standard error treats every question as adding independent information and can therefore understate uncertainty [src-067].
- The recommended fix is to cluster standard errors on the unit of randomization, such as the passage rather than the individual question [src-067].
- Anthropic reports that clustered standard errors on popular evals can be more than three times as large as naive standard errors [src-067].
- Ignoring clustering can make researchers believe they have detected a model-capability difference when the evidence is actually too noisy [src-067].
Related entities
Related concepts
- Statistical Model Evaluations
- Question-Universe Eval Framing
- Statistical Significance Testing
- Experiment Variance Reduction
Source references
- [src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)