Question-Universe Eval Framing

Question-universe eval framing treats a benchmark’s questions as a sample drawn from a broader universe of possible questions with a similar difficulty distribution.

Key points

Anthropic says the object of interest is not the observed average on one benchmark but the theoretical average across all possible questions of that type ^[src-067].
This framing separates model skill from the luck of drawing easier or harder questions in a particular benchmark ^[src-067].
Under the Central Limit Theorem, repeated benchmark samples from the same universe would have means that approximate a normal distribution around the theoretical mean ^[src-067].
Reporting standard error and confidence intervals makes the implied question-universe uncertainty visible ^[src-067].
The framing only works cleanly when the sampling assumptions are plausible; clustered questions require different uncertainty estimates ^[src-067].

Related entities

Anthropic

Related concepts

Source references

^[src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)

Question-Universe Eval Framing

Question-Universe Eval Framing

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services