Paired-Difference Model Evals

Paired-Difference Model Evals

Paired-difference model evals compare models question by question on the same benchmark items, using the shared question list to reduce noise in the measured difference.

Key points

  • Anthropic argues that eval scores are meaningful mainly in relation to other scores: one model beats, ties, or trails another [src-067].
  • A two-sample test ignores the fact that model scores are usually collected on the same questions [src-067].
  • Paired-difference analysis removes variance from question difficulty and focuses on variance in model responses [src-067].
  • Anthropic reports that frontier-model question-score correlations on popular evals are often substantial, roughly 0.3 to 0.7, because models tend to get many of the same questions right or wrong [src-067].
  • The practical recommendation is to report pairwise mean differences, standard errors, confidence intervals, and correlations whenever comparing two or more models [src-067].

Related entities

Related concepts

Source references

  • [src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)