You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-12-debugging-numeric-comparisons-llms.md
+14-8Lines changed: 14 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,7 +65,7 @@ LLMs can ace complex reasoning yet still fail at simple numeric comparisons like
65
65
66
66
### 4) Causal edits: last-layer MLP corrupts the decision
67
67
68
-
**What**: Activation patching from **L24→L25** at multiple hooks using TransformerLens.
68
+
**What**: Activation patching from **L24→L25** at multiple hooks using TransformerLens[Nanda & Bloom, 2022](#references).
69
69
**Finding**: Patching `mlp_post`/`resid_post`**improves** accuracy; other patches often hurt.
70
70
**Why**: This isolates the **final-block MLP** as the primary corruption source.
71
71
@@ -82,7 +82,7 @@ LLMs can ace complex reasoning yet still fail at simple numeric comparisons like
82
82
83
83
## Bottom line
84
84
85
-
Gemma-2-2B-IT internally represents the correct comparator but the last-layer MLP (readout) introduces a Yes-biased corruption. Simple, principled interventions—patching or ablating ~50 neurons—substantially reduce errors, and diagnostics suggest a lingering length heuristic distinct from true value comparison.
85
+
Gemma-2-2B-IT internally represent the correct comparator but the last-layer MLP (readout) introduces a Yes-biased corruption. Simple, principled interventions—patching or ablating ~50 neurons—substantially reduce errors, and diagnostics suggest a lingering length heuristic distinct from true value comparison.
86
86
87
87
88
88
@@ -200,7 +200,9 @@ To understand the geometry of internal model's representation we conducted a PCA
200
200
PCA was fit on each dataset's activations. Activations from other datasets were then projected into this learned PCA space.
201
201
This helps us to ask: `does a separating axis from one dataset also reveal structure in another? Answer is YES`.
202
202
203
-
Calculated cosine similarity between **Yes-No mean difference** axes in source vs target representations.
203
+
Calculated cosine similarity between **Yes-No mean difference**[Appendix B](#appendix-b) axes in source vs target representations.
204
+
205
+
204
206
205
207
### Cross-dataset alignment (cosine similarity)
206
208
@@ -336,14 +338,14 @@ Patching the output of the MLP sub-block (resid_post, mlp_post) reliably improve
336
338
337
339
**Methodology**:
338
340
339
-
1. Discovery: First, I identified the most "harmful" neurons by scoring their negative impact on accuracy across the entire dataset. Neurons were ranked based on a gradient-based method that measures how much their activation contributes to pushing the final decision in the wrong direction (see Appendix C for the mathematical details).
341
+
1. Discovery: First, I identified the most "harmful" neurons by scoring their negative impact on accuracy across the entire dataset. Neurons were ranked based on a gradient-based method that measures how much their activation contributes to pushing the final decision in the wrong direction (see [Appendix C](#appendix-c) for the mathematical details).
340
342
341
343
2. Verification: To ensure these findings weren't just an artifact of overfitting to the test data, I repeated the experiment with a formal train/validation split. The harmful neurons were identified using only the training data, and then ablated to measure the performance change on the held-out validation data.
342
344
343
345
344
346
345
347
Score neurons by their negative impact on accuracy (per dataset and globally). . Note that here we are using the full data as training and prediction with no splits. Next section deals with train and validation splits, the results remain similar.
346
-
Appendix C. describes the mathematical intuition.
348
+
[Appendix C](#appendix-c). describes the mathematical intuition.
347
349
348
350
h_j element_wise_multiplication (W_out[j] · g) where g is gradient of dot product of projection of layer norm and Yes/No direction with respect to r_post, and W_out is the weight element of j_{th} neuron and h_j is the activation value of the neuron.
349
351
We then align this with truth_yn predictions (+1 if Yes else -1)
@@ -413,7 +415,7 @@ Then Integer_diff_len = 22
413
415
then tied integer_equal_len and decimal_equal_len = 27
414
416
415
417
416
-
### B. What is the Yes–No mean difference?
418
+
### B. What is the Yes–No mean difference? {#appendix-b}
417
419
418
420
Let each example i have:
419
421
- activation vector: h_i (dimension d)
@@ -439,7 +441,7 @@ Signed score of any activation h along this axis
0 commit comments