Skip to content

Commit 924e82c

Browse files
debugging
1 parent 6ee75b9 commit 924e82c

File tree

1 file changed

+14
-8
lines changed

1 file changed

+14
-8
lines changed

_posts/2025-09-12-debugging-numeric-comparisons-llms.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ LLMs can ace complex reasoning yet still fail at simple numeric comparisons like
6565

6666
### 4) Causal edits: last-layer MLP corrupts the decision
6767

68-
**What**: Activation patching from **L24→L25** at multiple hooks using TransformerLens.
68+
**What**: Activation patching from **L24→L25** at multiple hooks using TransformerLens [Nanda & Bloom, 2022](#references).
6969
**Finding**: Patching `mlp_post`/`resid_post` **improves** accuracy; other patches often hurt.
7070
**Why**: This isolates the **final-block MLP** as the primary corruption source.
7171

@@ -82,7 +82,7 @@ LLMs can ace complex reasoning yet still fail at simple numeric comparisons like
8282

8383
## Bottom line
8484

85-
Gemma-2-2B-IT internally represents the correct comparator but the last-layer MLP (readout) introduces a Yes-biased corruption. Simple, principled interventions—patching or ablating ~50 neurons—substantially reduce errors, and diagnostics suggest a lingering length heuristic distinct from true value comparison.
85+
Gemma-2-2B-IT internally represent the correct comparator but the last-layer MLP (readout) introduces a Yes-biased corruption. Simple, principled interventions—patching or ablating ~50 neurons—substantially reduce errors, and diagnostics suggest a lingering length heuristic distinct from true value comparison.
8686

8787

8888

@@ -200,7 +200,9 @@ To understand the geometry of internal model's representation we conducted a PCA
200200
PCA was fit on each dataset's activations. Activations from other datasets were then projected into this learned PCA space.
201201
This helps us to ask: `does a separating axis from one dataset also reveal structure in another? Answer is YES`.
202202

203-
Calculated cosine similarity between **Yes-No mean difference** axes in source vs target representations.
203+
Calculated cosine similarity between **Yes-No mean difference** [Appendix B](#appendix-b) axes in source vs target representations.
204+
205+
204206

205207
### Cross-dataset alignment (cosine similarity)
206208

@@ -336,14 +338,14 @@ Patching the output of the MLP sub-block (resid_post, mlp_post) reliably improve
336338

337339
**Methodology**:
338340

339-
1. Discovery: First, I identified the most "harmful" neurons by scoring their negative impact on accuracy across the entire dataset. Neurons were ranked based on a gradient-based method that measures how much their activation contributes to pushing the final decision in the wrong direction (see Appendix C for the mathematical details).
341+
1. Discovery: First, I identified the most "harmful" neurons by scoring their negative impact on accuracy across the entire dataset. Neurons were ranked based on a gradient-based method that measures how much their activation contributes to pushing the final decision in the wrong direction (see [Appendix C](#appendix-c) for the mathematical details).
340342

341343
2. Verification: To ensure these findings weren't just an artifact of overfitting to the test data, I repeated the experiment with a formal train/validation split. The harmful neurons were identified using only the training data, and then ablated to measure the performance change on the held-out validation data.
342344

343345

344346

345347
Score neurons by their negative impact on accuracy (per dataset and globally). . Note that here we are using the full data as training and prediction with no splits. Next section deals with train and validation splits, the results remain similar.
346-
Appendix C. describes the mathematical intuition.
348+
[Appendix C](#appendix-c). describes the mathematical intuition.
347349

348350
h_j element_wise_multiplication (W_out[j] · g) where g is gradient of dot product of projection of layer norm and Yes/No direction with respect to r_post, and W_out is the weight element of j_{th} neuron and h_j is the activation value of the neuron.
349351
We then align this with truth_yn predictions (+1 if Yes else -1)
@@ -413,7 +415,7 @@ Then Integer_diff_len = 22
413415
then tied integer_equal_len and decimal_equal_len = 27
414416

415417

416-
### B. What is the Yes–No mean difference?
418+
### B. What is the Yes–No mean difference? {#appendix-b}
417419

418420
Let each example i have:
419421
- activation vector: h_i (dimension d)
@@ -439,7 +441,7 @@ Signed score of any activation h along this axis
439441
Separation magnitude along the axis
440442
- separation = | E[score(h) | y=Yes] − E[score(h) | y=No] |
441443

442-
### C. Harmful Neuron Finding
444+
### C. Harmful Neuron Finding {#appendix-c}
443445

444446
#### How do I rank the neurons from most harmful to least harmful?
445447
- We focus on the last transformer block post which it goes to ln_head.
@@ -481,7 +483,11 @@ Basically trying out to see, h_j how strongly neuron j is firing and gradient pr
481483
Next step would be to multiply it by (1 if Yes else -1) to align by truth.
482484

483485

484-
## References
486+
## References {#references}
487+
488+
- Nanda, N., & Bloom, J. (2022). *TransformerLens*. GitHub repository. https://github.com/TransformerLensOrg/TransformerLens
489+
- Alain, G., & Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644. https://arxiv.org/abs/1610.01644
490+
485491

486492

487493
## Disclaimer

0 commit comments

Comments
 (0)