Skip to content

BurnyCoder/ai-interpretability-wiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

143 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Interpretability Wiki

A local mirror of the AI Interpretability Wiki, written in MediaWiki and hosted on Miraheze.

About

Interpretability is the ability for the decision processes and inner workings of artificial intelligence and machine learning systems to be understood by humans or other outside observers. [source]

Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within AI interpretability and explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. The approach seeks to analyze neural networks in a manner similar to how binary computer programs can be reverse-engineered to understand their functions. [source] "This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification 'horse'." [source]

Contents

Category Description
Highlighted work Notable and influential work in AI interpretability
Concepts Core ideas and terminology
Methods Techniques and approaches used to interpret models
Architectures for interpretability Architectures used to interpret models
Applications Practical uses and deployments
Phenomena Behaviors and properties analyzed
Features and circuits Features and circuits identified and analyzed in neural networks
Surveys Survey papers and literature reviews
Theory Theoretical foundations, mathematical and formal frameworks in AI and AI interpretability
Papers Papers in AI interpretability
AI architectures Architectures of studied models
Interpretability architectures Architectures that are designed to be more interpretable
Github codebases Code repositories
People and groups Researchers, labs, and organizations
Communities Communities on Discord, Slack, etc.
Feeds Feeds for papers, blogs, newsletters, and content feeds
Youtube channels Video content and educational channels
Resources for learning Tutorials, courses, and learning materials
Events Conferences, workshops, and meetups

All pages

Highlighted work

  • Progress measures for grokking via mechanistic interpretability
  • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
  • On the Biology of a Large Language Model
  • When Models Manipulate Manifolds: The Geometry of a Counting Task

Concepts

  • Feature
  • Circuit
  • Polysemanticity
  • Monosemanticity
  • Superposition
  • Attribution graph
  • Pragmatic interpretability
  • Explainable AI

Methods

  • Dictionary learning

Architectures for interpretability

  • Sparse autoencoder
  • Linear probe
  • Cross-layer transcoder

AI architectures

  • Transformer
  • Convolutional neural network

Phenomena

  • Alignment
  • Grokking
  • Jailbreak
  • Addition
  • Curve detector
  • Dog detector
  • Golden Gate Bridge Claude
  • Chain of thought

Features and circuits

  • Feature
  • Circuit
  • Attribution graph
  • Curve detector
  • Dog detector

Theory

  • A Mathematical Framework for Transformer Circuits
  • The Principles of Deep Learning Theory
  • Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
  • Singular learning theory

Surveys

  • Mechanistic Interpretability for AI Safety -- A Review
  • Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
  • An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Github codebases

  • TransformerLens
  • Neuronpedia
  • SAELens

People and groups

  • Neel Nanda
  • Anthropic
  • OpenAI
  • DeepMind
  • GoodFire
  • EleutherAI

Communities

  • Mech Interp discord
  • Alignment Forum
  • Open Source Mechanistic Interpretability Slack
  • EleutherAI

Feeds

  • Anthropic Interpretability research
  • Transformer Circuits Thread
  • Alignment Forum
  • Open Source Mechanistic Interpretability Slack

Resources for learning

  • ARENA
  • How to become a mechanistic interpretability researcher

Papers

  • Progress measures for grokking via mechanistic interpretability
  • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
  • On the Biology of a Large Language Model
  • When Models Manipulate Manifolds: The Geometry of a Counting Task

Random page

Click here to go to a random page on the wiki!

Tips

  • To check which pages link to the page you're on, go to the Tools menu on the top right and click on "What links here".
  • To enable dark mode, log in, go to Preferences, Appearance, and select Dark. Alternatively, use the Dark Reader Chrome extension, but it might be a bit broken.

Wiki created by

About

A local mirror of the AI Interpretability Wiki, written in MediaWiki and hosted on Miraheze. Mechanistic interpretability aims to understand the internal workings of neural networks. Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors