Skip to content

In this codebook, we perform real causal interpretability experiment and perform real mechanistic interpretability.

License

Notifications You must be signed in to change notification settings

himanshuvnm/Mechanistic_Artificial_Intelligence_Interpretability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mechanistic Artificial Intelligence Interpretability

This repository contains a mechanistic interpretability study of transformer models using causal activation patching and circuit localization. The project applies intervention-based methods to reverse-engineer how GPT-style transformers perform relational reasoning tasks.

The core goal is to treat neural networks as computational systems and uncover the internal mechanisms and information flow that give rise to reasoning behavior.


🔬 Project Overview

This project implements a full causal interpretability pipeline using TransformerLens:

  • Construction of clean and corrupted reasoning prompts
  • Definition of a causal decision metric
  • Residual stream activation patching across layers
  • Localization of reasoning circuits
  • Attention head analysis and information flow visualization

The primary experiment studies indirect-object reasoning (e.g., "A gave the book to B. Therefore, ___") and identifies which transformer layers causally implement the relational inference.

Rather than relying on correlation analysis, this work uses causal interventions to localize the computation responsible for model decisions.


🧠 Key Results

Using activation patching on GPT-2-small:

  • Localized the reasoning circuit to mid-to-late transformer layers (layers 8–10)
  • Demonstrated causal dependence of output behavior on internal residual stream representations
  • Identified the layer region where relational information is stored and transformed
  • Produced layer-wise recovery curves and attention heatmaps

These results replicate known transformer reasoning phenomena and demonstrate a complete mechanistic analysis workflow.


📊 Methods

  • TransformerLens (HookedTransformer)
  • Residual stream activation patching
  • Causal intervention on internal activations
  • Layer-wise circuit localization
  • Attention head inspection and visualization
  • Decision-metric based evaluation

In the following figure, on x-axis we have Tranformer Layers index from 0-11 and y-axis plot the decision metric. We clearly see the dashed line value and solid line in the figure. More interpretation details are as follows:

Layer Index Interpretation
0-1-2-3 Weak influence (mostly syntax & token structure)
4-5-6 Partial influence (entity tracking starts)
7-8-9 Strong influence (core reasoning happens here)
10 Strong Influence
11 Weak Again (logit formatting layer)

From the following figure “Activation Patching: Which layers matter most?” we see that the curve dipping strongly at layers 8, 9, 10 which means these layers are where the model is computing “who received the object”. This tells exactly about the known-reasoning issues of the transformer model, that is: Early layers = form; Middle layers = meaning and Late layers = decision.

Unknown-4

From the following figure Normalized Recovery by Layer, we provide the following rationales:

Layer Index Recovery Interpretation
0-1-2 approx 0.4 Slight contribution
3-4-5 approx 0.5-0.7 Partial Reasoning
6-7 approx 0.8 Important
8 approx 1.1 Core Reasoning
9 approx 1.2 Core Reasoning
10 approx 1.15 Core Reasoning
11 approx 0.33 Mostly Formatting
Unknown-5

The causal reasoning circuit lives primarily in layers 8–10.


Research Themes

--- Mechanistic interpretability

--- Causal analysis of transformers

--- Neural circuit discovery

--- Alignment & transparency

About

In this codebook, we perform real causal interpretability experiment and perform real mechanistic interpretability.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors