Question-Universe Eval Framing

Question-Universe Eval Framing

Question-universe eval framing treats a benchmark’s questions as a sample drawn from a broader universe of possible questions with a similar difficulty distribution.

Key points

  • Anthropic says the object of interest is not the observed average on one benchmark but the theoretical average across all possible questions of that type [src-067].
  • This framing separates model skill from the luck of drawing easier or harder questions in a particular benchmark [src-067].
  • Under the Central Limit Theorem, repeated benchmark samples from the same universe would have means that approximate a normal distribution around the theoretical mean [src-067].
  • Reporting standard error and confidence intervals makes the implied question-universe uncertainty visible [src-067].
  • The framing only works cleanly when the sampling assumptions are plausible; clustered questions require different uncertainty estimates [src-067].

Related entities

Related concepts

Source references

  • [src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)