Natural Language Autoencoders

Natural Language Autoencoders (NLAs) are an Anthropic interpretability framework that translates model activations into natural-language explanations and then reconstructs the original activation from those explanations to score whether useful information survived the round trip.

Key facts

Type: Interpretability framework
Maker: Anthropic
Announced: 2026-05-07
Status: Research method with released code and an interactive Neuronpedia demo ^[src-066]
Core components: target model, activation verbalizer, and activation reconstructor ^[src-066]

What it does

NLAs start with a frozen target model and collect internal activations from it. An activation verbalizer turns one of those activations into text, while an activation reconstructor uses only that text to recreate the original activation. The explanation is judged by reconstruction quality rather than by direct access to a ground-truth “thought” ^[src-066].

The practical promise is that NLAs can surface information the model may know internally but does not say externally. Anthropic used the method to inspect safety evaluations, investigate hidden motivations, and diagnose a model behavior where English prompts sometimes led to responses in other languages ^[src-066].

NLAs remain fragile. Anthropic says they can hallucinate and are expensive because training requires reinforcement learning over two model copies and inference emits many tokens per activation. The near-term stance is to treat explanations as hypotheses or themes, then corroborate them with independent methods ^[src-066].

See also: Anthropic, Claude Mythus
Concepts: Model Interpretability, Activation Reconstruction Fidelity, Evaluation Awareness, Model Auditing Games, Continuous Agent Evaluation

Source references

^[src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)

Natural Language Autoencoders

Natural Language Autoencoders

Key facts

What it does

Related

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services