Skip to content

Commit 826ceab

Browse files
committed
fixed counting equation. added spaces.
1 parent 17cceb5 commit 826ceab

File tree

2 files changed

+8
-7
lines changed

2 files changed

+8
-7
lines changed

_posts/2024-05-30-counting.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ In this blog post, we summarize a recent paper which is part of an ongoing effor
2020

2121

2222
## Introducing the Contextual Counting Task
23-
23+
<br>
2424
In this task, the input is a sequence composed of zeros, ones, and square bracket delimiters: `{0, 1, [, ]}`. Each sample sequence contains ones and zeros with several regions marked by the delimiters. The task is to count the number of ones within each delimited region. For example, given the sequence:
2525

2626
```
@@ -57,7 +57,7 @@ Moreover, toy problems are instrumental in benchmarking and testing new theories
5757

5858

5959
## Theoretical Insights
60-
60+
<br>
6161
We provide some theoretical insights into the problem, showing that a Transformer with one causal encoding layer and one decoding layer can solve the contextual counting task for arbitrary sequence lengths and numbers of regions.
6262

6363
#### Contextual Position (CP)
@@ -90,7 +90,7 @@ These propositions highlight the difficulties non-causal Transformers face in so
9090

9191

9292
## Experimental Results
93-
93+
<br>
9494
The theoretical results above imply that exact solutions exist but do not clarify whether or not such solutions can indeed be found when the model is trained via SGD. We therefore trained various Transformer architectures on this task. Inspired by the theoretical arguments, we use an encoder-decoder architecture, with one layer and one head for each. A typical output of the network is shown in the following image where the model outputs the probability distribution over the number of ones in each region.
9595

9696
<p align="center">
@@ -176,19 +176,20 @@ If you made it this far, here is an interesting bonus point:
176176

177177
* Even though the model has access to the number n through its attention profile, it still does not construct a probability distribution that is sharply peaked at n. As we see in the above figure, as n gets large, this probability distribution gets wider. This, we believe is partly the side-effect of this specific solution where two curves are being balanced against each other. But it is partly a general problem that as the number of tokens that are attended to gets large, we need higher accuracy to be able to infer n exactly. This is because the information about n is coded non-linearly after the attention layer. In this case, if we assume that the model attends to BoS and 1-tokens equally the output becomes:
178178

179-
$\frac1{n+1} (n \times v_1 + 1 \times v_\text{BoS})$
179+
<p align="center">
180+
<img src="/images/blog/counting/n_dependence.png" alt="The n-dependence of the model output." width="55%" style="mix-blend-mode: darken;">
181+
</p>
180182

181-
We see that as n becomes large, the difference between $n$ and $n+1$ becomes smaller.
183+
We see that as n becomes large, the difference between n and n+1 becomes smaller.
182184

183185

184186
## Conclusion
185-
187+
<br>
186188
The contextual counting task provides a valuable framework for exploring the interpretability of Transformers in scientific and quantitative contexts. Our experiments show that causal Transformers with NoPE can effectively solve this task, while non-causal models struggle. These findings highlight the importance of task-specific interpretability challenges and the potential for developing more robust and generalizable models for scientific applications.
187189

188190
For more details, check out our preprint on the [arXiv](link).
189191

190192
*-- Siavash Golkar*
191193

192-
---
193194

194195
Image by [Tim Mossholder](https://unsplash.com/photos/blue-and-black-electric-wires-FwzhysPCQZc) via Unsplash.
26.6 KB
Loading

0 commit comments

Comments
 (0)