Paired-Difference Model Evals

Paired-difference model evals compare models question by question on the same benchmark items, using the shared question list to reduce noise in the measured difference.

Key points

Anthropic argues that eval scores are meaningful mainly in relation to other scores: one model beats, ties, or trails another ^[src-067].
A two-sample test ignores the fact that model scores are usually collected on the same questions ^[src-067].
Paired-difference analysis removes variance from question difficulty and focuses on variance in model responses ^[src-067].
Anthropic reports that frontier-model question-score correlations on popular evals are often substantial, roughly 0.3 to 0.7, because models tend to get many of the same questions right or wrong ^[src-067].
The practical recommendation is to report pairwise mean differences, standard errors, confidence intervals, and correlations whenever comparing two or more models ^[src-067].

Related entities

Anthropic

Related concepts

Source references

^[src-067] Anthropic – “A statistical approach to model evaluations” (2024-11-19)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 491 indexed pages and articles.

Paired-Difference Model Evals

Paired-Difference Model Evals

Key points

Related entities

Related concepts

Source references

Robin Cartier perspective

Keep reading from this thread

Robin Cartier

Company

Services