Clustered Standard Errors in Evals

Clustered standard errors in evals adjust uncertainty estimates when benchmark questions are grouped around shared passages, tasks, or other non-independent units.

Key points

Anthropic notes that many reading-comprehension evals include several questions about the same passage, so the questions are not independent samples from the question universe ^[src-067].
A naive standard error treats every question as adding independent information and can therefore understate uncertainty ^[src-067].
The recommended fix is to cluster standard errors on the unit of randomization, such as the passage rather than the individual question ^[src-067].
Anthropic reports that clustered standard errors on popular evals can be more than three times as large as naive standard errors ^[src-067].
Ignoring clustering can make researchers believe they have detected a model-capability difference when the evidence is actually too noisy ^[src-067].

Related entities

Anthropic

Related concepts

Source references

^[src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)

Clustered Standard Errors in Evals

Clustered Standard Errors in Evals

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services