AI Interpretability Wiki

A local mirror of the AI Interpretability Wiki, written in MediaWiki and hosted on Miraheze.

About

Interpretability is the ability for the decision processes and inner workings of artificial intelligence and machine learning systems to be understood by humans or other outside observers. [source]

Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within AI interpretability and explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. The approach seeks to analyze neural networks in a manner similar to how binary computer programs can be reverse-engineered to understand their functions. [source] "This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification 'horse'." [source]

Category	Description
Highlighted work	Notable and influential work in AI interpretability
Concepts	Core ideas and terminology
Methods	Techniques and approaches used to interpret models
Architectures for interpretability	Architectures used to interpret models
Applications	Practical uses and deployments
Phenomena	Behaviors and properties analyzed
Features and circuits	Features and circuits identified and analyzed in neural networks
Surveys	Survey papers and literature reviews
Theory	Theoretical foundations, mathematical and formal frameworks in AI and AI interpretability
Papers	Papers in AI interpretability
AI architectures	Architectures of studied models
Interpretability architectures	Architectures that are designed to be more interpretable
Github codebases	Code repositories
People and groups	Researchers, labs, and organizations
Communities	Communities on Discord, Slack, etc.
Feeds	Feeds for papers, blogs, newsletters, and content feeds
Youtube channels	Video content and educational channels
Resources for learning	Tutorials, courses, and learning materials
Events	Conferences, workshops, and meetups

All pages

Highlighted work

Progress measures for grokking via mechanistic interpretability
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
On the Biology of a Large Language Model
When Models Manipulate Manifolds: The Geometry of a Counting Task

Concepts

Feature
Circuit
Polysemanticity
Monosemanticity
Superposition
Attribution graph
Pragmatic interpretability
Explainable AI

Methods

Dictionary learning

Architectures for interpretability

Sparse autoencoder
Linear probe
Cross-layer transcoder

AI architectures

Transformer
Convolutional neural network

Phenomena

Alignment
Grokking
Jailbreak
Addition
Curve detector
Dog detector
Golden Gate Bridge Claude
Chain of thought

Features and circuits

Feature
Circuit
Attribution graph
Curve detector
Dog detector

Theory

A Mathematical Framework for Transformer Circuits
The Principles of Deep Learning Theory
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Singular learning theory

Surveys

Mechanistic Interpretability for AI Safety -- A Review
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Github codebases

TransformerLens
Neuronpedia
SAELens

People and groups

Neel Nanda
Anthropic
OpenAI
DeepMind
GoodFire
EleutherAI

Communities

Mech Interp discord
Alignment Forum
Open Source Mechanistic Interpretability Slack
EleutherAI

Feeds

Anthropic Interpretability research
Transformer Circuits Thread
Alignment Forum
Open Source Mechanistic Interpretability Slack

Resources for learning

ARENA
How to become a mechanistic interpretability researcher

Papers

Progress measures for grokking via mechanistic interpretability
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
On the Biology of a Large Language Model
When Models Manipulate Manifolds: The Geometry of a Counting Task

Random page

Click here to go to a random page on the wiki!

Tips

To check which pages link to the page you're on, go to the Tools menu on the top right and click on "What links here".
To enable dark mode, log in, go to Preferences, Appearance, and select Dark. Alternatively, use the Dark Reader Chrome extension, but it might be a bit broken.

Wiki created by

Burny (website, X, GitHub, LinkedIn, Facebook, Telegram, BlueSky, Mastodon, Discord: burnytech, burian.lib@gmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
ai-architectures		ai-architectures
architectures-for-interpretability		architectures-for-interpretability
communities		communities
concepts		concepts
feeds		feeds
github-codebases		github-codebases
methods		methods
papers		papers
people-and-groups		people-and-groups
phenomena		phenomena
resources-for-learning		resources-for-learning
surveys		surveys
theory		theory
AI_architectures.mediawiki		AI_architectures.mediawiki
Architectures_for_interpretability.mediawiki		Architectures_for_interpretability.mediawiki
CLAUDE.md		CLAUDE.md
Communities.mediawiki		Communities.mediawiki
Concepts.mediawiki		Concepts.mediawiki
Feeds.mediawiki		Feeds.mediawiki
Github_codebases.mediawiki		Github_codebases.mediawiki
Highlighted_work.mediawiki		Highlighted_work.mediawiki
Main_Page.mediawiki		Main_Page.mediawiki
Methods.mediawiki		Methods.mediawiki
Papers.mediawiki		Papers.mediawiki
People_and_groups.mediawiki		People_and_groups.mediawiki
Phenomena.mediawiki		Phenomena.mediawiki
README.md		README.md
Resources_for_learning.mediawiki		Resources_for_learning.mediawiki
Surveys.mediawiki		Surveys.mediawiki
Theory.mediawiki		Theory.mediawiki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Interpretability Wiki

About

Contents

All pages

Highlighted work

Concepts

Methods

Architectures for interpretability

AI architectures

Phenomena

Features and circuits

Theory

Surveys

Github codebases

People and groups

Communities

Feeds

Resources for learning

Papers

Random page

Tips

Wiki created by

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Interpretability Wiki

About

Contents

All pages

Highlighted work

Concepts

Methods

Architectures for interpretability

AI architectures

Phenomena

Features and circuits

Theory

Surveys

Github codebases

People and groups

Communities

Feeds

Resources for learning

Papers

Random page

Tips

Wiki created by

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages