Home | Aditya Pratap Singh

What is Mechanistic Interpretability?

Mechanistic Interpretability (MI) is a subfield of AI alignment and machine learning research focused on understanding how neural networks work internally—not just what they do, but the specific algorithms and representations they learn.

Unlike traditional interpretability methods (like saliency maps or feature importance), MI aims to reverse-engineer the actual computations happening inside models, much like understanding a circuit diagram rather than just observing inputs and outputs.

Why Does It Matter?

As AI systems become more powerful and are deployed in critical applications, we need to:

Verify alignment - Ensure models are pursuing intended objectives
Detect deception - Identify if models are behaving differently during evaluation vs. deployment
Predict failures - Understand edge cases before they occur
Build trust - Provide genuine explanations for model behavior

Key Concepts

Superposition

Models represent more features than they have dimensions by encoding features in overlapping, almost-orthogonal directions. This makes interpretation challenging but is key to understanding model capacity.

Circuits

Specific computational subgraphs within neural networks that implement identifiable algorithms (e.g., induction heads in transformers that enable in-context learning).

Features

The fundamental units of representation that models learn—often corresponding to human-interpretable concepts like "is a proper noun" or "refers to a location."

Resources

Papers

A Mathematical Framework for Transformer Circuits - Anthropic
Toy Models of Superposition - Anthropic
Scaling Monosemanticity - Anthropic
Zoom In: An Introduction to Circuits - Distill

Courses & Tutorials

ARENA Mechanistic Interpretability - Comprehensive hands-on course
Neel Nanda's MI Tutorials - Excellent starting point
TransformerLens Library - Python library for MI research

Blogs & Communities

Transformer Circuits Thread - Anthropic's research blog
AI Alignment Forum - Discussion and papers
LessWrong MI Posts - Community research

TODO: Learning Path

[ ] Read "A Mathematical Framework for Transformer Circuits" paper
[ ] Complete ARENA Chapter 1: Transformer Interpretability
[ ] Implement attention pattern visualization from scratch
[ ] Study induction heads and their role in in-context learning
[ ] Explore the TransformerLens library with GPT-2 small
[ ] Read "Toy Models of Superposition" paper
[ ] Implement sparse autoencoders for feature extraction
[ ] Study Anthropic's dictionary learning approach
[ ] Read "Scaling Monosemanticity" paper
[ ] Contribute to open-source MI tools or replicate a finding

Current Research Directions

Sparse Autoencoders - Decomposing activations into interpretable features
Automated Circuit Discovery - Using algorithms to find circuits automatically
Scaling Laws for Interpretability - How interpretation difficulty scales with model size
Causal Interventions - Activation patching and causal tracing
Cross-model Feature Universality - Do different models learn similar features?

The field is rapidly evolving, with new techniques and discoveries emerging regularly. Stay curious and keep experimenting!