Model Auditing Games

Model auditing games are controlled experiments where an auditor investigates a target model with known hidden issues, testing whether auditing tools can uncover the root cause or motivation.

Key points

Anthropic used an auditing game in which a Claude Haiku 3.5 variant was intentionally trained to be misaligned with a hidden motivation that caused many unusual behaviors ^[src-066].
The auditor’s goal was to infer the single underlying motivation behind the behaviors, not merely list the symptoms ^[src-066].
With NLAs, auditors uncovered the hidden motivation 12-15% of the time without access to the training data that implanted it ^[src-066].
Without NLAs, auditors succeeded less than 3% of the time even when given other interpretability tools ^[src-066].
Anthropic cautions that this is still a toy model of misalignment, but treats it as evidence that natural-language interpretability can improve hidden-motivation audits ^[src-066].
Anthropic’s personal-guidance stress tests are a behavioral cousin of auditing games: researchers seed the model with a real sycophantic conversation and evaluate whether it can recover under adverse conversational context ^[src-073].
This makes “can the model change direction after a bad trajectory?” a measurable safety question, not only a qualitative review ^[src-073].

Related entities

Related concepts

Source references

^[src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)
^[src-073] Anthropic – “How people ask Claude for personal guidance” (2026-04-30)

Model Auditing Games

Model Auditing Games

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services