listend to a podcast on goodfire.ai. a few good resources on mechanistic interpretability
Key Research Papers
- Tracing the thoughts of a large language model (Anthropic)
- On the Biology of a Large Language Model
- Zoom In: An Introduction to Circuits
- Toy Models of Superposition
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Stage-Wise Model Diffing
- Mapping the Latent Space of Llama 3.3 70B
- Attribution-based parameter decomposition (LessWrong)
Blogs and Guides
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers (AI Alignment Forum)
- Mechanistic interpretability (Wikipedia)
- Emergent Misalignment: Narrow Finetuning can produce Broadly Misaligned LLMs
- Under the Hood of a Reasoning Model (Goodfire AI)
- Language models can explain neurons in language models (OpenAI)
- Interpreting Evo 2 (Goodfire AI)
- The Urgency of Interpretability (Dario Amodei)
Tools and Applications
Additional Resources
- Sparse Autoencoder (Stanford)
- Apollo Research