BENEDICT NEO 梁耀恩

💭📰nowcuriusprojectslibraryfindmearchive🎲

goodfire

July 14, 2025

#journal

listend to a podcast on goodfire.ai. a few good resources on mechanistic interpretability

Key Research Papers

  • Tracing the thoughts of a large language model (Anthropic)
  • On the Biology of a Large Language Model
  • Zoom In: An Introduction to Circuits
  • Toy Models of Superposition
  • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
  • Stage-Wise Model Diffing
  • Mapping the Latent Space of Llama 3.3 70B
  • Attribution-based parameter decomposition (LessWrong)

Blogs and Guides

  • An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers (AI Alignment Forum)
  • Mechanistic interpretability (Wikipedia)
  • Emergent Misalignment: Narrow Finetuning can produce Broadly Misaligned LLMs
  • Under the Hood of a Reasoning Model (Goodfire AI)
  • Language models can explain neurons in language models (OpenAI)
  • Interpreting Evo 2 (Goodfire AI)
  • The Urgency of Interpretability (Dario Amodei)

Tools and Applications

  • Paint with Ember
  • Painting With Concepts Using Diffusion Model Latents

Additional Resources

  • Sparse Autoencoder (Stanford)
  • Apollo Research

Next:

mt zion dry run

Previous:

floating sushi boat