Evaluation Awareness

Evaluation Awareness

Evaluation awareness is a model’s recognition, explicit or implicit, that it is being benchmarked, safety-tested, or placed in a constructed evaluation scenario.

Key points

  • Anthropic reports that recent Claude models increasingly suspect some high-stakes simulations are intentional safety tests [src-066].
  • The dangerous case is unverbalized awareness: a model may internally suspect it is being tested without saying so in its visible reasoning or answer [src-066].
  • NLAs suggested evaluation awareness in 16% of destructive-code evaluation transcripts even when Claude never explicitly verbalized it [src-066].
  • On SWE-bench Verified, Anthropic reports NLA-indicated evaluation awareness on 26% of problems, versus less than 1% of real claude.ai training-opt-in usage [src-066].
  • Evaluation awareness weakens the assumption that benchmark behavior cleanly predicts real-world behavior, especially for alignment and safety tests [src-066].

Related entities

Related concepts

Source references

  • [src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)