|
486 | 486 | "source": [
|
487 | 487 | "### Image feature extractor\n",
|
488 | 488 | "\n",
|
489 |
| - "You will use an image model (pretrained on imagenet) to extract the features from each image. The model was trained as an image classifier, but setting `include_top=False` returns the model without the final classification layer, so you can use the last layer of feature-maps: \n", |
490 |
| - "\n", |
491 |
| - "\n" |
| 489 | + "You will use an image model (pretrained on imagenet) to extract the features from each image. The model was trained as an image classifier, but setting `include_top=False` returns the model without the final classification layer, so you can use the last layer of feature-maps: \n" |
492 | 490 | ]
|
493 | 491 | },
|
494 | 492 | {
|
|
1053 | 1051 | "id": "qiRXWwIKNybB"
|
1054 | 1052 | },
|
1055 | 1053 | "source": [
|
1056 |
| - "\n", |
1057 |
| - "\n", |
1058 | 1054 | "The model will be implemented in three main parts: \n",
|
1059 | 1055 | "\n",
|
1060 | 1056 | "1. Input - The token embedding and positional encoding (`SeqEmbedding`).\n",
|
|
1164 | 1160 | " attn = self.mha(query=x, value=x,\n",
|
1165 | 1161 | " use_causal_mask=True)\n",
|
1166 | 1162 | " x = self.add([x, attn])\n",
|
1167 |
| - " return self.layernorm(x)\n", |
1168 |
| - "\n" |
| 1163 | + " return self.layernorm(x)\n" |
1169 | 1164 | ]
|
1170 | 1165 | },
|
1171 | 1166 | {
|
|
1305 | 1300 | "id": "6WQD87efena5"
|
1306 | 1301 | },
|
1307 | 1302 | "source": [
|
1308 |
| - "\n", |
1309 |
| - "\n", |
1310 | 1303 | "But there are a few other features you can add to make this work a little better:\n",
|
1311 | 1304 | "\n",
|
1312 | 1305 | "1. **Handle bad tokens**: The model will be generating text. It should\n",
|
|
1484 | 1477 | "1. Flatten the extracted image features, so they can be input to the decoder layers.\n",
|
1485 | 1478 | "2. Look up the token embeddings.\n",
|
1486 | 1479 | "3. Run the stack of `DecoderLayer`s, on the image features and text embeddings.\n",
|
1487 |
| - "4. Run the output layer to predict the next token at each position.\n", |
1488 |
| - "\n" |
| 1480 | + "4. Run the output layer to predict the next token at each position.\n" |
1489 | 1481 | ]
|
1490 | 1482 | },
|
1491 | 1483 | {
|
|
2144 | 2136 | "colab": {
|
2145 | 2137 | "collapsed_sections": [],
|
2146 | 2138 | "name": "image_captioning.ipynb",
|
2147 |
| - "private_outputs": true, |
2148 |
| - "provenance": [], |
2149 | 2139 | "toc_visible": true
|
2150 | 2140 | },
|
2151 | 2141 | "kernelspec": {
|
|
0 commit comments