Clustered Standard Errors in Evals

Clustered Standard Errors in Evals

Clustered standard errors in evals adjust uncertainty estimates when benchmark questions are grouped around shared passages, tasks, or other non-independent units.

Key points

  • Anthropic notes that many reading-comprehension evals include several questions about the same passage, so the questions are not independent samples from the question universe [src-067].
  • A naive standard error treats every question as adding independent information and can therefore understate uncertainty [src-067].
  • The recommended fix is to cluster standard errors on the unit of randomization, such as the passage rather than the individual question [src-067].
  • Anthropic reports that clustered standard errors on popular evals can be more than three times as large as naive standard errors [src-067].
  • Ignoring clustering can make researchers believe they have detected a model-capability difference when the evidence is actually too noisy [src-067].

Related entities

Related concepts

Source references

  • [src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)