Update transformer.ipynb

ajweeks · ajweeks · commit 7cb6e3c730f4 · 2025-03-29T12:22:30.000+01:00
diff --git a/workshops/transformer/transformer.ipynb b/workshops/transformer/transformer.ipynb
@@ -59,7 +59,7 @@
                 "> ##### Learning objectives\n",
                 ">\n",
                 "> - Understand what a transformer is used for\n",
-                "> - Understand causal attention, and what a transformer's output representsalgebra operations on tensors\n",
+                "> - Understand causal attention, and what a transformer's output represents (algebraic operations on tensors)\n",
                 "> - Learn what tokenization is, and how models do it\n",
                 "> - Understand what logits are, and how to use them to derive a probability distribution over the vocabulary\n",
                 "\n",
@@ -195,7 +195,7 @@
             "cell_type": "markdown",
             "metadata": {},
             "source": [
-                "Our tranformer's input is natural language (i.e. a sequence of characters, strings, etc). But ML models generally take vectors as input, not langage. How do we convert language to vectors?\n",
+                "Our transformer's input is natural language (i.e., a string of characters). But ML models generally take vectors as input, not language. How do we convert language to vectors?\n",
                 "\n",
                 "We can factor this into 2 questions:\n",
                 "\n",
@@ -211,7 +211,7 @@
             "source": [
                 "### Converting sub-units to vectors\n",
                 "\n",
-                "We basically make a massive lookup table, which is called an **embedding**. It has one vector for each possible sub-unit of language we might get (we call this set of all sub-units our **vocabulary**). We label every element in our vocabulary with an integer (this labelling never changes), and we use this integer to index into the embedding.\n",
+                "Text is split into small pieces called tokens (e.g., words or word parts) using a tokenizer. Each token gets a number, which we turn into a vector (a list of numbers) using a table called an **embedding**.\n",
                 "\n",
                 "A key intuition is that one-hot encodings let you think about each integer independently. We don't bake in any relation between words when we perform our embedding, because every word has a completely separate embedding vector.\n",
                 "\n",
@@ -349,7 +349,7 @@
                 "\n",
                 "#### **Step 1:** Convert text to tokens\n",
                 "\n",
-                "The sequence gets tokenized, so it has shape `[batch, seq_len]`. Here, the batch dimension is just one (because we only have one sequence).\n"
+                "The sequence gets tokenized, so it has shape `[batch, seq_len]`. Here, the batch dimension is just one because we only have one sequence. Think of `batch` as how many sentences we process, and `seq_len` as sentence length in tokens.\n"
             ]
         },
         {
@@ -592,13 +592,13 @@
             "source": [
                 "#### Attention\n",
                 "\n",
-                "First we have attention. This moves information from prior positions in the sequence to the current token.\n",
+                "Attention heads decide which words matter most for predicting the next one, copying info from those spots.\n",
                 "\n",
                 "We do this for *every* token in parallel using the same parameters. The only difference is that we look backwards only (to avoid \"cheating\"). This means later tokens have more of the sequence that they can look at.\n",
                 "\n",
                 "Attention layers are the only bit of a transformer that moves information between positions (i.e. between vectors at different sequence positions in the residual stream).\n",
                 "\n",
-                "Attention layers are made up of `n_heads` heads - each with their own parameters, own attention pattern, and own information how to copy things from source to destination. The heads act independently and additively, we just add their outputs together, and back to the stream.\n",
+                "Attention layers are made up of `n_heads` attention heads - each with their own parameters, own attention pattern, and own information how to copy things from source to destination. The heads act independently and additively, we just sum their outputs together back into the residual stream.\n",
                 "\n",
                 "Each head does the following:\n",
                 "* Produces an **attention pattern** for each destination token, a probability distribution of prior source tokens (including the current one) weighting how much information to copy.\n",
@@ -629,7 +629,7 @@
             "source": [
                 "### MLP\n",
                 "\n",
-                "The MLP layers are just a standard neural network, with a singular hidden layer and a nonlinear activation function. The exact activation isn't conceptually important ([GELU](https://paperswithcode.com/method/gelu) seems to perform best).\n",
+                "The MLP layers are just a standard machine learning, with a singular hidden layer and a nonlinear activation function. The exact activation function used isn't conceptually important, though [GeLU](https://paperswithcode.com/method/gelu) is often chosen for its performance.\n",
                 "\n",
                 "Our hidden dimension is normally `d_mlp = 4 * d_model`. Exactly why the ratios are what they are isn't super important (people basically cargo-cult what GPT did back in the day!).\n",
                 "\n",
@@ -667,9 +667,10 @@
                 "\n",
                 "#### LayerNorm\n",
                 "\n",
-                "* Simple normalization function applied at the start of each layer (i.e. before each MLP, attention layer, and before the unembedding)\n",
-                "* Converts each input vector (independently in parallel for each batch x position residual stream vector) to have mean zero and variance 1.\n",
-                "* Then applies an elementwise scaling and translation\n",
+                "* Forces vectors into the range `[0, 1]` which helps the model learn more efficiently.\n",
+                "* Applied at the start of each layer (i.e. before each MLP, attention layer, and before the unembedding)\n",
+                "* Converts each input vector (independently in parallel for each batch x position residual stream vector) to have mean `0.0` and variance `1.0`\n",
+                "* Then applies an element-wise scaling and translation\n",
                 "* Cool maths tangent: The scale & translate is just a linear map. LayerNorm is only applied immediately before another linear map. Linear compose linear = linear, so we can just fold this into a single effective linear layer and ignore it.\n",
                 "    * `fold_ln=True` flag in `from_pretrained` does this for you.\n",
                 "* LayerNorm is annoying for interpertability - the scale part is not linear, so you can't think about different bits of the input independently. But it's *almost* linear - if you're changing a small part of the input it's linear, but if you're changing enough to alter the norm substantially it's not linear.\n",
@@ -679,13 +680,12 @@
                 "#### Positional embeddings\n",
                 "\n",
                 "* **Problem:** Attention operates over all pairs of positions. This means it's symmetric with regards to position - the attention calculation from token 5 to token 1 and token 5 to token 2 are the same by default\n",
-                "    * This is dumb because nearby tokens are more relevant.\n",
-                "* There's a lot of dumb hacks for this.\n",
-                "* We'll focus on **learned, absolute positional embeddings**. This means we learn a lookup table mapping the index of the position of each token to a residual stream vector, and add this to the embed.\n",
+                "    * This is inefficient because nearby tokens are often more relevant.\n",
+                "* One solution is **learned, absolute positional embeddings**. This involves a learned lookup table mapping each token's position to a residual stream vector, and adding this to the embed.\n",
                 "    * Note that we *add* rather than concatenate. This is because the residual stream is shared memory, and likely under significant superposition (the model compresses more features in there than the model has dimensions)\n",
                 "    * We basically never concatenate inside a transformer, unless doing weird shit like generating text efficiently.\n",
                 "* This connects to **attention as generalized convolution**\n",
-                "    * We argued that language does still have locality, and so it's helpful for transformers to have access to the positional information so they \"know\" two tokens are next to each other (and hence probably relevant to each other).\n",
+                " * Since language does have locality it is helpful for transformers to have access to positional information so they \"know\" whether two tokens are next to each other, and hence likely relevant to each other.\n",
                 "</details>"
             ]
         },
@@ -927,7 +927,7 @@
                 "        nn.init.normal_(self.W_pos, std=self.cfg.init_range)\n",
                 "\n",
                 "    def forward(self, tokens: Int[Tensor, \"batch token\"]) -> Float[Tensor, \"batch token d_model\"]:\n",
-                "        # Hint: You should use the einops.repeat or torch.reapeat function\n",
+                "        # Hint: You should use the einops.repeat or torch.repeat function\n",
                 "        # to repeat batch-wise the positional embedding.\n",
                 "        # Hide: hard\n",
                 "        # The value of tokens is not important here, only the size of the tensor!\n",
@@ -1125,7 +1125,7 @@
                 "    einops.einsum(\n",
                 "        normalized_resid_pre,\n",
                 "        self.W_Q,\n",
-                "        \"batch posn d_model, nheads d_model d_head -> ???\",\n",
+                "        \"batch token d_model, head d_model d_head -> ???\",\n",
                 "    )\n",
                 "    + self.b_Q\n",
                 ")\n",
@@ -1138,7 +1138,7 @@
                 "attn_scores = einops.einsum(\n",
                 "    q,\n",
                 "    k,\n",
-                "    \"???,??? -> batch nheads posn_Q posn_K\",\n",
+                "    \"???,??? -> batch head token_Q token_K\",\n",
                 ")\n",
                 "\n",
                 "# then scale and apply mask and apply softmax on the correct dimension to get probabilities\n",
@@ -1149,15 +1149,15 @@
                 "z = einops.einsum(\n",
                 "    v,\n",
                 "    attn_pattern,\n",
-                "    \"???,??? -> batch posn_Q nheads d_head\",\n",
+                "    \"???,??? -> batch token_Q head d_head\",\n",
                 ")\n",
                 "\n",
                 "# Calculate output (by applying matrix W_O and summing over heads, then adding bias b_O)\n",
                 "attn_out = (\n",
                 "    einops.einsum(\n",
                 "        z,\n",
                 "        self.W_O,\n",
-                "        \"batch posn_Q nheads d_head, nheads d_head d_model -> batch posn_Q d_model\",\n",
+                "        \"batch token_Q head d_head, head d_head d_model -> batch token_Q d_model\",\n",
                 "    )\n",
                 "    + self.b_O\n",
                 ")\n",
@@ -1646,7 +1646,7 @@
             "source": [
                 "If you've finished this, congrats! \n",
                 "You should ask the TA what to do next. One option is \n",
-                "to look at training and sampling from tranformers in the rest of the arena notebook, which you can find it at https://colab.research.google.com/github/EffiSciencesResearch/ML4G-2.0/blob/master/workshops/transformer/transformer-arena.ipynb."
+                "to look at training and sampling from transformers in the rest of the arena notebook, which you can find it at https://colab.research.google.com/github/EffiSciencesResearch/ML4G-2.0/blob/master/workshops/transformer/transformer-arena.ipynb."
             ]
         }
     ],