goodfire

goodfire

July 14, 2025

#AI #ML #resources

listend to a podcast on goodfire.ai. a few good resources on mechanistic interpretability

Key Research Papers

Blogs and Guides

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers (AI Alignment Forum)
Mechanistic interpretability (Wikipedia)
Emergent Misalignment: Narrow Finetuning can produce Broadly Misaligned LLMs
Under the Hood of a Reasoning Model (Goodfire AI)
Language models can explain neurons in language models (OpenAI)
Interpreting Evo 2 (Goodfire AI)
The Urgency of Interpretability (Dario Amodei)

Tools and Applications

Additional Resources

Sparse Autoencoder (Stanford)
Apollo Research

Previous

floating sushi boat

Next

mt zion dry run