🔎 Mechanistic Interpretability for Understanding Language Abilities in Large Language Models: A Survey
We will continue to update this repository.
If you enjoy or benefit from the project, a star ⭐ on GitHub would be greatly appreciated and will help you stay informed about future updates.
- MIB: A Mechanistic Interpretability Benchmark
- AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
- InterpBench: Semi-Synthetic Transformers for Evaluating MI Techniques
- Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Senses
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
- Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
- SyntaxGym: An Online Platform for Targeted Evaluation of Language Models
- ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework
- Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
- Probing Syntax in Large Language Models: Successes and Remaining Challenges
- Probing Internal Representations of Multi-Word Verbs in Large Language Models
- Can Cross-Lingual Transferability of Multilingual Transformers Be Activated Without End-Task Data?
- Finding Neurons in a Haystack: Case Studies with Sparse Probing
- Probing Classifiers: Promises, Shortcomings, and Advances
- First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
- Finding Universal Grammatical Relations in Multilingual BERT
- On the Language Neutrality of Pre-trained Multilingual Representations
- Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-supervision
- A Structural Probe for Finding Syntax in Word Representations
- Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information
- Eliciting Latent Predictions from Transformers with the Tuned Lens
- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
- Interpreting GPT: the logit lens
- Unraveling Syntax: How Language Models Learn Context-Free Grammars
- The Semantic Hub Hypothesis: Language Models Share Semantic Representations
- Do Multilingual LLMs Think in English?
- Do Llamas Work in English? On the Latent Language of Multilingual Transformers
- DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
- Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
- Binary Autoencoder for Mechanistic Interpretability of Large Language Models
- LinguaLens: Towards Interpreting Linguistic Mechanisms via Sparse Auto-Encoder
- SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
- On the Theoretical Foundation of Sparse Dictionary Learning
- Crosscoding Through Time: Tracking Emergence of Linguistic Representations
- Large Language Models Share Representations of Latent Grammatical Concepts
- Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models
- Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages
- Unveiling Language-Specific Features via Sparse Autoencoders
- Analyzing Multilingualism in Large Language Models with Sparse Autoencoders
- Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders
- Causal Language Control in Multilingual Transformers via Sparse Feature Steering
- Semantic Convergence: Investigating Shared Representations Across Scaled LLMs
- Extended Abstract for “Linguistic Universals”: Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders
- Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Transcoders Find Interpretable LLM Feature Circuits
- How to use and interpret activation patching
- Is This the Subspace You Are Looking for? An Interpretability Illusion
- Towards Best Practices of Activation Patching in Language Models
- Function Vectors in Large Language Models
- Attribution Patching Outperforms Automated Circuit Discovery
- Attribution Patching: Activation Patching At Industrial Scale
- Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models
- The Dual-Route Model of Induction
- LANDeRMT: Detecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation
- Task-Specific Skill Localization in Fine-tuned Language Models
- Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
- Cross-Lingual Generalization and Compression
- How Syntax Specialization Emerges in Language Models
- Language Lives in Sparse Dimensions: Interpretable Multilingual Control
- Inducing Dyslexia in Vision Language Models
- Different types of syntactic agreement recruit the same units
- Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models
- Multilingual Knowledge Editing with Language-Agnostic Factual Neurons
- From Language to Cognition: How LLMs Outgrow the Human Language Network
- The Transfer Neurons Hypothesis
- Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
- The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model
- The LLM Language Network: A Neuroscientific Approach
- Language-Specific Neurons Do Not Facilitate Cross-Lingual Transfer
- Language-Specific Neurons: The Key to Multilingual Capabilities
- Unveiling Linguistic Regions in Large Language Models
- Converging to a Lingua Franca: Evolution of Linguistic Regions
- Unveiling Language Competence Neurons: A Psycholinguistic Approach
- Linguistic Minimal Pairs Elicit Linguistic Similarity
- Neuron-Level Knowledge Attribution in Large Language Models
- Neuron Specialization: Leveraging Intrinsic Task Modularity for Multilingual Machine Translation
- Decoding Probing: Revealing Internal Linguistic Structures
- On the Multilingual Ability: Finding and Controlling Language-Specific Neurons
- How do Large Language Models Handle Multilingualism?
- Universal Neurons in GPT2 Language Models
- Same Neurons, Different Languages
- Importance-based Neuron Allocation for Multilingual Neural Machine Translation
- Circuit Tracing: Revealing Computational Graphs in Language Models
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs
- Scaling Sparse Feature Circuits For Studying In-Context Learning
- Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- Towards Automated Circuit Discovery for Mechanistic Interpretability
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages
- The Same But Different: Structural Similarities in Multilingual LM
- Circuit Component Reuse Across Tasks in Transformer Language Models
Feel free to open an issue or contact us if you have any questions or want to include your work in this list!
Author: Xufeng Duan (xufengduan@cuhk.edu.hk) and Zhaoqian Yao (zhaoqianyao@cuhk.edu.hk)
