Paired-Difference Model Evals

Paired-difference model evals compare models question by question on the same benchmark items, using the shared question list to reduce noise in the measured difference.

Key points

Anthropic argues that eval scores are meaningful mainly in relation to other scores: one model beats, ties, or trails another ^[src-067].
A two-sample test ignores the fact that model scores are usually collected on the same questions ^[src-067].
Paired-difference analysis removes variance from question difficulty and focuses on variance in model responses ^[src-067].
Anthropic reports that frontier-model question-score correlations on popular evals are often substantial, roughly 0.3 to 0.7, because models tend to get many of the same questions right or wrong ^[src-067].
The practical recommendation is to report pairwise mean differences, standard errors, confidence intervals, and correlations whenever comparing two or more models ^[src-067].

Related entities

Anthropic

Related concepts

Source references

^[src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)

Paired-Difference Model Evals

Paired-Difference Model Evals

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services