Model Interpretability
Model interpretability is the practice of building tools that make a neural network’s internal representations, reasoning tendencies, and learned features more understandable to humans.
Key points
- Anthropic frames activations as the numerical middle layer where a model encodes information before producing words; interpretability tries to make those internal states readable [src-066].
- Earlier tools such as sparse autoencoders and attribution graphs taught researchers about activations but still produced complex artifacts that specialists had to interpret [src-066].
- Natural Language Autoencoders shift the interface toward human-readable explanations by making the model produce text about its own activations [src-066].
- Interpretability is safety-relevant because it can expose information a model knows but does not verbalize, such as suspected evaluation settings or hidden motivations [src-066].
- Interpretability outputs require corroboration: Anthropic warns that NLA explanations can hallucinate and should be read for recurring themes rather than treated as definitive claims [src-066].
Related entities
Related concepts
- Activation Reconstruction Fidelity
- Evaluation Awareness
- Model Auditing Games
- Continuous Agent Evaluation
- Agent Forensics
Source references
- [src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)