|
512 | 512 | "##### Equation 1\n",
|
513 | 513 | "An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle $2 \\mathrm{D}$ images, we reshape the image $\\mathrm{x} \\in \\mathbb{R}^{H \\times W \\times C}$ into a sequence of flattened $2 \\mathrm{D}$ patches $\\mathbf{x}_p \\in \\mathbb{R}^{N \\times\\left(P^2 \\cdot C\\right)}$, where $(H, W)$ is the resolution of the original image, $C$ is the number of channels, $(P, P)$ is the resolution of each image patch, and $N=H W / P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size $D$ through all of its layers, so we flatten the patches and map to $D$ dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.\n",
|
514 | 514 | "\n",
|
| 515 | + "##### Equation 1\n", |
| 516 | + "Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.\n", |
| 517 | + "\n", |
| 518 | + "In pseudocode:\n", |
| 519 | + "\n", |
| 520 | + "```python\n", |
| 521 | + "# Equation 1\n", |
| 522 | + "x_input = [class_token, image_patch_1, image_patch_2, ..., image_patch_N] + [class_token_pos, image_patch_1_pos, image_patch_2_pos, ..., image_patch_N_pos]\n", |
| 523 | + "```\n", |
| 524 | + "---\n", |
| 525 | + "\n", |
| 526 | + "##### Equation 2&3\n", |
| 527 | + "The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski \\& Auli, 2019).\n", |
| 528 | + "\n", |
| 529 | + "In pseudocode:\n", |
| 530 | + "\n", |
| 531 | + "```python\n", |
| 532 | + "# Equation 2\n", |
| 533 | + "x_output_MSA_block = MSA_layer(LN_Layer(x_input)) + x_input\n", |
| 534 | + "\n", |
| 535 | + "# Equation 3\n", |
| 536 | + "x_output_MLP_block = MLP_layer(LN_layer(x_output_MSA_block)) + x_output_MSA_block\n", |
| 537 | + "```\n", |
| 538 | + "\n", |
515 | 539 | "##### Equation 4\n",
|
516 | 540 | "Similar to BERT's [class] token, we prepend a learnable embedding to the sequence of embedded patches $\\left(\\mathbf{z}_0^0=\\mathbf{x}_{\\text {class }}\\right)$, whose state at the output of the Transformer encoder $\\left(\\mathbf{z}_L^0\\right)$ serves as the image representation $y$ (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $\\mathbf{z}_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.\n",
|
517 | 541 | "\n",
|
518 |
| - "##### Equation 1\n", |
519 |
| - "Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.\n", |
| 542 | + "In pseudocode:\n", |
520 | 543 | "\n",
|
521 |
| - "##### Equation 2\n", |
522 |
| - "The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski \\& Auli, 2019)." |
| 544 | + "```python\n", |
| 545 | + "# Equation 4\n", |
| 546 | + "y = Linear_layer(LN_layer(x_output_MLP_block))\n", |
| 547 | + "```" |
523 | 548 | ]
|
524 | 549 | },
|
525 | 550 | {
|
|
0 commit comments