Activation Reconstruction Fidelity

Activation Reconstruction Fidelity

Activation reconstruction fidelity is the degree to which a generated explanation preserves enough information to recreate the original model activation.

Key points

  • In Anthropic’s NLA setup, researchers cannot directly know whether a natural-language explanation is “true”, because the ground-truth thought encoded by an activation is not directly observable [src-066].
  • The workaround is a round trip: original activation -> text explanation -> reconstructed activation [src-066].
  • A better explanation is one that enables a more accurate reconstructed activation, so reconstruction similarity becomes a training signal and evaluation metric [src-066].
  • This makes interpretability partly testable: the explanation is still text for humans, but the method has a quantitative pressure to retain information from the activation [src-066].
  • High reconstruction fidelity does not guarantee semantic truth. Anthropic still warns that NLA explanations can hallucinate and need independent corroboration [src-066].

Related entities

Related concepts

Source references

  • [src-066] Anthropic – “Natural Language Autoencoders: Turning Claude’s thoughts into text” (2026-05-07)