A local mirror of the AI Interpretability Wiki, written in MediaWiki and hosted on Miraheze.
Interpretability is the ability for the decision processes and inner workings of artificial intelligence and machine learning systems to be understood by humans or other outside observers. [source]
Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within AI interpretability and explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. The approach seeks to analyze neural networks in a manner similar to how binary computer programs can be reverse-engineered to understand their functions. [source] "This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification 'horse'." [source]
| Category | Description |
|---|---|
| Highlighted work | Notable and influential work in AI interpretability |
| Concepts | Core ideas and terminology |
| Methods | Techniques and approaches used to interpret models |
| Architectures for interpretability | Architectures used to interpret models |
| Applications | Practical uses and deployments |
| Phenomena | Behaviors and properties analyzed |
| Features and circuits | Features and circuits identified and analyzed in neural networks |
| Surveys | Survey papers and literature reviews |
| Theory | Theoretical foundations, mathematical and formal frameworks in AI and AI interpretability |
| Papers | Papers in AI interpretability |
| AI architectures | Architectures of studied models |
| Interpretability architectures | Architectures that are designed to be more interpretable |
| Github codebases | Code repositories |
| People and groups | Researchers, labs, and organizations |
| Communities | Communities on Discord, Slack, etc. |
| Feeds | Feeds for papers, blogs, newsletters, and content feeds |
| Youtube channels | Video content and educational channels |
| Resources for learning | Tutorials, courses, and learning materials |
| Events | Conferences, workshops, and meetups |
- Progress measures for grokking via mechanistic interpretability
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- On the Biology of a Large Language Model
- When Models Manipulate Manifolds: The Geometry of a Counting Task
- Feature
- Circuit
- Polysemanticity
- Monosemanticity
- Superposition
- Attribution graph
- Pragmatic interpretability
- Explainable AI
- Dictionary learning
- Sparse autoencoder
- Linear probe
- Cross-layer transcoder
- Transformer
- Convolutional neural network
- Alignment
- Grokking
- Jailbreak
- Addition
- Curve detector
- Dog detector
- Golden Gate Bridge Claude
- Chain of thought
- Feature
- Circuit
- Attribution graph
- Curve detector
- Dog detector
- A Mathematical Framework for Transformer Circuits
- The Principles of Deep Learning Theory
- Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
- Singular learning theory
- Mechanistic Interpretability for AI Safety -- A Review
- Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
- TransformerLens
- Neuronpedia
- SAELens
- Neel Nanda
- Anthropic
- OpenAI
- DeepMind
- GoodFire
- EleutherAI
- Mech Interp discord
- Alignment Forum
- Open Source Mechanistic Interpretability Slack
- EleutherAI
- Anthropic Interpretability research
- Transformer Circuits Thread
- Alignment Forum
- Open Source Mechanistic Interpretability Slack
- ARENA
- How to become a mechanistic interpretability researcher
- Progress measures for grokking via mechanistic interpretability
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- On the Biology of a Large Language Model
- When Models Manipulate Manifolds: The Geometry of a Counting Task
Click here to go to a random page on the wiki!
- To check which pages link to the page you're on, go to the Tools menu on the top right and click on "What links here".
- To enable dark mode, log in, go to Preferences, Appearance, and select Dark. Alternatively, use the Dark Reader Chrome extension, but it might be a bit broken.