Activation Reconstruction Fidelity
Activation reconstruction fidelity is the degree to which a generated explanation preserves enough information to recreate the original model activation.
Key points
- In Anthropic’s NLA setup, researchers cannot directly know whether a natural-language explanation is “true”, because the ground-truth thought encoded by an activation is not directly observable [src-066].
- The workaround is a round trip: original activation -> text explanation -> reconstructed activation [src-066].
- A better explanation is one that enables a more accurate reconstructed activation, so reconstruction similarity becomes a training signal and evaluation metric [src-066].
- This makes interpretability partly testable: the explanation is still text for humans, but the method has a quantitative pressure to retain information from the activation [src-066].
- High reconstruction fidelity does not guarantee semantic truth. Anthropic still warns that NLA explanations can hallucinate and need independent corroboration [src-066].
Related entities
Related concepts
Source references
- [src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)