You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -14,14 +14,14 @@ author: divyansh singhvi and LLMs
14
14
Study on Gemma-2-2B-IT
15
15
16
16
> **TL;DR:**
17
-
> LLMs *internally* represents the correct way to compare numbers (80–90% accuracy in penultimate layers) but the **final layer corrupts this knowledge**, causing simple failures like `9.8 < 9.11`.
17
+
> LLMs *internally* represents the correct way to compare numbers (80–90% accuracy in penultimate layers) but the **final layer corrupts this knowledge**, causing simple failures like `9.11 > 9.8`.
18
18
19
19
---
20
20
21
21
# Executive Summary
22
22
23
23
## The Problem: A paradox in LLM Capabilities
24
-
LLMs can ace complex reasoning yet still fail at simple numeric comparisons like 9.8 < 9.11. Using Gemma-2-2B-IT, I ask: **Does the model internally represent the correct Yes/No decision, and if so, where does the failure happen?** This matters because robust numeric comparison is a prerequisite for any downstream task that relies on arithmetic or ordering.
24
+
LLMs can ace complex reasoning yet still fail at simple numeric comparisons like 9.11 > 9.8. Using Gemma-2-2B-IT, I ask: **Does the model internally represent the correct Yes/No decision, and if so, where does the failure happen?** This matters because robust numeric comparison is a prerequisite for any downstream task that relies on arithmetic or ordering.
25
25
26
26
## High-level takeaways
27
27
@@ -103,7 +103,7 @@ Gemma-2-2B-IT internally represent the correct comparator but the last-layer MLP
103
103
---
104
104
105
105
## 1. Motivation
106
-
Large Language Models (LLMs) have demonstrated Olympiad-level performance in complex reasoning, yet they paradoxically stumble on fundamental operations like basic numeric comparisons `9.8 < 9.11`. If a model cannot reliably perform basic numeric comparisons, its utility in downstream tasks that depend on such reasoning is severely compromised. This study investigates why a model like Gemma-2-2B-IT fails at these seemingly simple evaluations.
106
+
Large Language Models (LLMs) have demonstrated Olympiad-level performance in complex reasoning, yet they paradoxically stumble on fundamental operations like basic numeric comparisons `9.11 > 9.8`. If a model cannot reliably perform basic numeric comparisons, its utility in downstream tasks that depend on such reasoning is severely compromised. This study investigates why a model like Gemma-2-2B-IT fails at these seemingly simple evaluations.
107
107
108
108
---
109
109
@@ -384,10 +384,10 @@ Results are not very different from the previous approach with similar trends.
384
384
385
385
- (a) **Prompt diversity** : Have analysed a single prompt. Can try for different types of prompts.
386
386
- (b) **String Analysis** : More exploration needed why strings were biased towards *No*.
387
-
- (c) Tokenization was not deeply analysed. Only observation was that every number and comparator operator was getting tokenized individually.
387
+
- (c) Tokenization was not deeply analysed.
388
388
- (d) Didn't try **PCA post ablation** to see how the geometry changes.
389
389
- (e) Work currently analyse only one model **Gemma-2-2b-it**. And also only instruction tuned model. A good study could have been how the last layer of non instruction tuned behaved vs the instruction tuned.
390
-
- (f) Doesn't analyse negative numbers in comparison.
390
+
- (f) Doesn't analyse negative numbers in comparison.
0 commit comments