Skip to content

Commit 3c8ec55

Browse files
Fix to |
1 parent 82a92e7 commit 3c8ec55

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

_posts/2025-09-12-debugging-numeric-comparisons-llms.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@ Clearing in we can see, integers_diff_len has a much lower correlation with oth
280280

281281
### 4.1 unembedding analysis
282282

283-
**Goal**: Project model's activation onto it's final output head (model.lm_head.weight). Let **r** = (`W_u[Yes] − W_u[No]`) / ||(`W_u[Yes] − W_u[No]`)||. We compute two metrics, For each layer’s activations **h**, compute **logit gap** = ⟨h, r⟩ and **forced-choice accuracy** = (sign(gap)) * (+1 if Yes else -1), the classification accuracy obtained by taking the sign of logit gap as the prediction multiplied by 1 if Yes else -1.
283+
**Goal**: Project model's activation onto it's final output head (model.lm_head.weight). Let **r** = (`W_u[Yes] − W_u[No]`) / \|\|(`W_u[Yes] − W_u[No]`)\|\|. We compute two metrics, For each layer’s activations **h**, compute **logit gap** = ⟨h, r⟩ and **forced-choice accuracy** = (sign(gap)) * (+1 if Yes else -1), the classification accuracy obtained by taking the sign of logit gap as the prediction multiplied by 1 if Yes else -1.
284284

285285
a. Logit gap: the positive value for the gap indicates bias towards Yes and negative towards No. The magnitude tells how strongly it reflects.
286286
b. Forced choice accuracy: "If model were forced to decide Yes vs No using only the activations projected at this layer, how accurate would it be? "
@@ -434,13 +434,13 @@ Yes–No mean difference vector
434434
- Delta = mu_Yes − mu_No (a vector in ℝ^d)
435435

436436
Unit direction (simple linear probe)
437-
- w = Delta / ||Delta||_2 (normalize Delta to length 1)
437+
- w = Delta / \|\|Delta\|\|_2 (normalize Delta to length 1)
438438

439439
Signed score of any activation h along this axis
440440
- score(h) = dot(h, w) (positive ⇒ evidence for “Yes”, negative ⇒ “No”)
441441

442442
Separation magnitude along the axis
443-
- separation = | E[score(h) | y=Yes] − E[score(h) | y=No] |
443+
- separation = \| E[score(h) \| y=Yes] − E[score(h) \| y=No] \|
444444

445445
### C. Harmful Neuron Finding {#appendix-c}
446446

0 commit comments

Comments
 (0)