Skip to content

Commit 510e713

Browse files
Update 06.ipynb
1 parent d9d92d4 commit 510e713

File tree

1 file changed

+19
-1
lines changed

1 file changed

+19
-1
lines changed

06.ipynb

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -518,15 +518,33 @@
518518
"In pseudocode:\n",
519519
"\n",
520520
"```python\n",
521+
"# Equation 1\n",
521522
"x_input = [class_token, image_patch_1, image_patch_2, ..., image_patch_N] + [class_token_pos, image_patch_1_pos, image_patch_2_pos, ..., image_patch_N_pos]\n",
522523
"```\n",
523524
"---\n",
524525
"\n",
525526
"##### Equation 2&3\n",
526527
"The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski \\& Auli, 2019).\n",
527528
"\n",
529+
"In pseudocode:\n",
530+
"\n",
531+
"```python\n",
532+
"# Equation 2\n",
533+
"x_output_MSA_block = MSA_layer(LN_Layer(x_input)) + x_input\n",
534+
"\n",
535+
"# Equation 3\n",
536+
"x_output_MLP_block = MLP_layer(LN_layer(x_output_MSA_block)) + x_output_MSA_block\n",
537+
"```\n",
538+
"\n",
528539
"##### Equation 4\n",
529-
"Similar to BERT's [class] token, we prepend a learnable embedding to the sequence of embedded patches $\\left(\\mathbf{z}_0^0=\\mathbf{x}_{\\text {class }}\\right)$, whose state at the output of the Transformer encoder $\\left(\\mathbf{z}_L^0\\right)$ serves as the image representation $y$ (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $\\mathbf{z}_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time."
540+
"Similar to BERT's [class] token, we prepend a learnable embedding to the sequence of embedded patches $\\left(\\mathbf{z}_0^0=\\mathbf{x}_{\\text {class }}\\right)$, whose state at the output of the Transformer encoder $\\left(\\mathbf{z}_L^0\\right)$ serves as the image representation $y$ (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $\\mathbf{z}_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.\n",
541+
"\n",
542+
"In pseudocode:\n",
543+
"\n",
544+
"```python\n",
545+
"# Equation 4\n",
546+
"y = Linear_layer(LN_layer(x_output_MLP_block))\n",
547+
"```"
530548
]
531549
},
532550
{

0 commit comments

Comments
 (0)