🔎 Mechanistic Interpretability for Understanding Language Abilities in Large Language Models: A Survey

We will continue to update this repository.

If you enjoy or benefit from the project, a star ⭐ on GitHub would be greatly appreciated and will help you stay informed about future updates.

📖 Table of Contents

Overview
Taxonomy
Paper List
Citation
Contact

📖 Overview

MI Survey - graph.pdf

📖 Taxonomy

📖 Paper List

1. Benchmarks

2. Probing

ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Probing Syntax in Large Language Models: Successes and Remaining Challenges
Probing Internal Representations of Multi-Word Verbs in Large Language Models
Can Cross-Lingual Transferability of Multilingual Transformers Be Activated Without End-Task Data?
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Probing Classifiers: Promises, Shortcomings, and Advances
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
Finding Universal Grammatical Relations in Multilingual BERT
On the Language Neutrality of Pre-trained Multilingual Representations
Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-supervision
A Structural Probe for Finding Syntax in Word Representations
Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information

3. Vocabulary Projection (Logit Lens)

4. Sparse AutoEncoders (SAE)

5. Activation Patching

6. Neuron Analysis

7. Circuit Discovery

📧 Contact

Feel free to open an issue or contact us if you have any questions or want to include your work in this list!

Author: Xufeng Duan (xufengduan@cuhk.edu.hk) and Zhaoqian Yao (zhaoqianyao@cuhk.edu.hk)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MI Survey - graph.pdf		MI Survey - graph.pdf
readme.md		readme.md
taxonomy.png		taxonomy.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔎 Mechanistic Interpretability for Understanding Language Abilities in Large Language Models: A Survey

📖 Table of Contents

📖 Overview

📖 Taxonomy

📖 Paper List

1. Benchmarks

2. Probing

3. Vocabulary Projection (Logit Lens)

4. Sparse AutoEncoders (SAE)

5. Activation Patching

6. Neuron Analysis

7. Circuit Discovery

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🔎 Mechanistic Interpretability for Understanding Language Abilities in Large Language Models: A Survey

📖 Table of Contents

📖 Overview

📖 Taxonomy

📖 Paper List

1. Benchmarks

2. Probing

3. Vocabulary Projection (Logit Lens)

4. Sparse AutoEncoders (SAE)

5. Activation Patching

6. Neuron Analysis

7. Circuit Discovery

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages