Natural Language Autoencoders

Natural Language Autoencoders

Natural Language Autoencoders (NLAs) are an Anthropic interpretability framework that translates model activations into natural-language explanations and then reconstructs the original activation from those explanations to score whether useful information survived the round trip.

Key facts

  • Type: Interpretability framework
  • Maker: Anthropic
  • Announced: 2026-05-07
  • Status: Research method with released code and an interactive Neuronpedia demo [src-066]
  • Core components: target model, activation verbalizer, and activation reconstructor [src-066]

What it does

NLAs start with a frozen target model and collect internal activations from it. An activation verbalizer turns one of those activations into text, while an activation reconstructor uses only that text to recreate the original activation. The explanation is judged by reconstruction quality rather than by direct access to a ground-truth “thought” [src-066].

The practical promise is that NLAs can surface information the model may know internally but does not say externally. Anthropic used the method to inspect safety evaluations, investigate hidden motivations, and diagnose a model behavior where English prompts sometimes led to responses in other languages [src-066].

NLAs remain fragile. Anthropic says they can hallucinate and are expensive because training requires reinforcement learning over two model copies and inference emits many tokens per activation. The near-term stance is to treat explanations as hypotheses or themes, then corroborate them with independent methods [src-066].

Related

Source references

  • [src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)