In the broadest sense, mechanistic interpretability refers to explaining neural network behavior in terms of their internal components. We cover early work on vision models, transformer circuits, and automated circuit discovery. We then turn to superposition (what it means mathematically and why we think it occurs in modern transformer language models), the linear representation hypothesis, and sparse autoencoders. Finally, we discuss recent applications in deployed AI systems, and offer a balanced perspective on when mechanistic interpretability is the right tool and when other approaches may be more appropriate as future AI systems get more capable.
*Bio:* Arthur Conmy is a Senior Research Engineer at Google DeepMind. He produced foundational mechanistic interpretability research, including Interpretability in the Wild (ICLR) and ACDC: Automated Circuit Discovery (NeurIPS 2023), and recently added activation probes to live Gemini deployments to detect misuse.
