Evaluation Awareness
Evaluation awareness is a model’s recognition, explicit or implicit, that it is being benchmarked, safety-tested, or placed in a constructed evaluation scenario.
Key points
- Anthropic reports that recent Claude models increasingly suspect some high-stakes simulations are intentional safety tests [src-066].
- The dangerous case is unverbalized awareness: a model may internally suspect it is being tested without saying so in its visible reasoning or answer [src-066].
- NLAs suggested evaluation awareness in 16% of destructive-code evaluation transcripts even when Claude never explicitly verbalized it [src-066].
- On SWE-bench Verified, Anthropic reports NLA-indicated evaluation awareness on 26% of problems, versus less than 1% of real claude.ai training-opt-in usage [src-066].
- Evaluation awareness weakens the assumption that benchmark behavior cleanly predicts real-world behavior, especially for alignment and safety tests [src-066].
Related entities
Related concepts
- Model Interpretability
- Continuous Agent Evaluation
- Practitioner Model Benchmarking Methodology
- Model Auditing Games
Source references
- [src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)