|
68 | 68 | "id": "FbVhjPpzn6BM"
|
69 | 69 | },
|
70 | 70 | "source": [
|
71 |
| - "This tutorial demonstrates how to use [`tf.distribute.Strategy`](https://www.tensorflow.org/guide/distributed_training) with custom training loops. We will train a simple CNN model on the fashion MNIST dataset. The fashion MNIST dataset contains 60000 train images of size 28 x 28 and 10000 test images of size 28 x 28.\n", |
| 71 | + "This tutorial demonstrates how to use `tf.distribute.Strategy` — a TensorFlow API that provides an abstraction for [distributing your training](../../guide/distributed_training.ipynb) across multiple processing units (GPUs, multiple machines, or TPUs) — with custom training loops. In this example, you will train a simple convolutional neural network on the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) containing 70,000 images of size 28 x 28.\n", |
72 | 72 | "\n",
|
73 |
| - "We are using custom training loops to train our model because they give us flexibility and a greater control on training. Moreover, it is easier to debug the model and the training loop." |
| 73 | + "[Custom training loops](../customization/custom_training_walkthrough.ipynb) provide flexibility and a greater control on training. They also make it is easier to debug the model and the training loop." |
74 | 74 | ]
|
75 | 75 | },
|
76 | 76 | {
|
|
97 | 97 | "id": "MM6W__qraV55"
|
98 | 98 | },
|
99 | 99 | "source": [
|
100 |
| - "## Download the fashion MNIST dataset" |
| 100 | + "## Download the Fashion MNIST dataset" |
101 | 101 | ]
|
102 | 102 | },
|
103 | 103 | {
|
|
112 | 112 | "\n",
|
113 | 113 | "(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()\n",
|
114 | 114 | "\n",
|
115 |
| - "# Adding a dimension to the array -> new shape == (28, 28, 1)\n", |
116 |
| - "# We are doing this because the first layer in our model is a convolutional\n", |
| 115 | + "# Add a dimension to the array -> new shape == (28, 28, 1)\n", |
| 116 | + "# This is done because the first layer in our model is a convolutional\n", |
117 | 117 | "# layer and it requires a 4D input (batch_size, height, width, channels).\n",
|
118 | 118 | "# batch_size dimension will be added later on.\n",
|
119 | 119 | "train_images = train_images[..., None]\n",
|
120 | 120 | "test_images = test_images[..., None]\n",
|
121 | 121 | "\n",
|
122 |
| - "# Getting the images in [0, 1] range.\n", |
| 122 | + "# Scale the images to the [0, 1] range.\n", |
123 | 123 | "train_images = train_images / np.float32(255)\n",
|
124 | 124 | "test_images = test_images / np.float32(255)"
|
125 | 125 | ]
|
|
141 | 141 | "source": [
|
142 | 142 | "How does `tf.distribute.MirroredStrategy` strategy work?\n",
|
143 | 143 | "\n",
|
144 |
| - "* All the variables and the model graph is replicated on the replicas.\n", |
| 144 | + "* All the variables and the model graph are replicated across the replicas.\n", |
145 | 145 | "* Input is evenly distributed across the replicas.\n",
|
146 | 146 | "* Each replica calculates the loss and gradients for the input it received.\n",
|
147 | 147 | "* The gradients are synced across all the replicas by summing them.\n",
|
148 | 148 | "* After the sync, the same update is made to the copies of the variables on each replica.\n",
|
149 | 149 | "\n",
|
150 |
| - "Note: You can put all the code below inside a single scope. We are dividing it into several code cells for illustration purposes.\n" |
| 150 | + "Note: You can put all the code below inside a single scope. This example divides it into several code cells for illustration purposes.\n" |
151 | 151 | ]
|
152 | 152 | },
|
153 | 153 | {
|
|
158 | 158 | },
|
159 | 159 | "outputs": [],
|
160 | 160 | "source": [
|
161 |
| - "# If the list of devices is not specified in the\n", |
162 |
| - "# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.\n", |
| 161 | + "# If the list of devices is not specified in\n", |
| 162 | + "# `tf.distribute.MirroredStrategy` constructor, they will be auto-detected.\n", |
163 | 163 | "strategy = tf.distribute.MirroredStrategy()"
|
164 | 164 | ]
|
165 | 165 | },
|
|
171 | 171 | },
|
172 | 172 | "outputs": [],
|
173 | 173 | "source": [
|
174 |
| - "print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))" |
| 174 | + "print('Number of devices: {}'.format(strategy.num_replicas_in_sync))" |
175 | 175 | ]
|
176 | 176 | },
|
177 | 177 | {
|
|
183 | 183 | "## Setup input pipeline"
|
184 | 184 | ]
|
185 | 185 | },
|
186 |
| - { |
187 |
| - "cell_type": "markdown", |
188 |
| - "metadata": { |
189 |
| - "id": "0Qb6nDgxiN_n" |
190 |
| - }, |
191 |
| - "source": [ |
192 |
| - "Export the graph and the variables to the platform-agnostic SavedModel format. After your model is saved, you can load it with or without the scope." |
193 |
| - ] |
194 |
| - }, |
195 | 186 | {
|
196 | 187 | "cell_type": "code",
|
197 | 188 | "execution_count": null,
|
|
240 | 231 | "source": [
|
241 | 232 | "## Create the model\n",
|
242 | 233 | "\n",
|
243 |
| - "Create a model using `tf.keras.Sequential`. You can also use the Model Subclassing API to do this." |
| 234 | + "Create a model using `tf.keras.Sequential`. You can also use the [Model Subclassing API](https://www.tensorflow.org/guide/keras/custom_layers_and_models) or the [functional API](https://www.tensorflow.org/guide/keras/functional) to do this." |
244 | 235 | ]
|
245 | 236 | },
|
246 | 237 | {
|
|
286 | 277 | "source": [
|
287 | 278 | "## Define the loss function\n",
|
288 | 279 | "\n",
|
289 |
| - "Normally, on a single machine with 1 GPU/CPU, loss is divided by the number of examples in the batch of input.\n", |
| 280 | + "Normally, on a single machine with single GPU/CPU, loss is divided by the number of examples in the batch of input.\n", |
290 | 281 | "\n",
|
291 | 282 | "*So, how should the loss be calculated when using a `tf.distribute.Strategy`?*\n",
|
292 | 283 | "\n",
|
293 | 284 | "* For an example, let's say you have 4 GPU's and a batch size of 64. One batch of input is distributed\n",
|
294 | 285 | "across the replicas (4 GPUs), each replica getting an input of size 16.\n",
|
295 | 286 | "\n",
|
296 |
| - "* The model on each replica does a forward pass with its respective input and calculates the loss. Now, instead of dividing the loss by the number of examples in its respective input (BATCH_SIZE_PER_REPLICA = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64)." |
| 287 | + "* The model on each replica does a forward pass with its respective input and calculates the loss. Now, instead of dividing the loss by the number of examples in its respective input (`BATCH_SIZE_PER_REPLICA` = 16), the loss should be divided by the `GLOBAL_BATCH_SIZE` (64)." |
297 | 288 | ]
|
298 | 289 | },
|
299 | 290 | {
|
|
315 | 306 | "source": [
|
316 | 307 | "*How to do this in TensorFlow?*\n",
|
317 | 308 | "\n",
|
318 |
| - "* If you're writing a custom training loop, as in this tutorial, you should sum the per example losses and divide the sum by the GLOBAL_BATCH_SIZE: \n", |
| 309 | + "* If you're writing a custom training loop, as in this tutorial, you should sum the per example losses and divide the sum by the `GLOBAL_BATCH_SIZE`: \n", |
319 | 310 | "`scale_loss = tf.reduce_sum(loss) * (1. / GLOBAL_BATCH_SIZE)`\n",
|
320 | 311 | "or you can use `tf.nn.compute_average_loss` which takes the per example loss,\n",
|
321 |
| - "optional sample weights, and GLOBAL_BATCH_SIZE as arguments and returns the scaled loss.\n", |
| 312 | + "optional sample weights, and `GLOBAL_BATCH_SIZE` as arguments and returns the scaled loss.\n", |
322 | 313 | "\n",
|
323 | 314 | "* If you are using regularization losses in your model then you need to scale\n",
|
324 | 315 | "the loss value by number of replicas. You can do this by using the `tf.nn.scale_regularization_loss` function.\n",
|
|
351 | 342 | "outputs": [],
|
352 | 343 | "source": [
|
353 | 344 | "with strategy.scope():\n",
|
354 |
| - " # Set reduction to `none` so we can do the reduction afterwards and divide by\n", |
| 345 | + " # Set reduction to `NONE` so you can do the reduction afterwards and divide by\n", |
355 | 346 | " # global batch size.\n",
|
356 | 347 | " loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n",
|
357 | 348 | " from_logits=True,\n",
|
|
484 | 475 | "\n",
|
485 | 476 | " template = (\"Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, \"\n",
|
486 | 477 | " \"Test Accuracy: {}\")\n",
|
487 |
| - " print (template.format(epoch+1, train_loss,\n", |
488 |
| - " train_accuracy.result()*100, test_loss.result(),\n", |
489 |
| - " test_accuracy.result()*100))\n", |
| 478 | + " print(template.format(epoch + 1, train_loss,\n", |
| 479 | + " train_accuracy.result() * 100, test_loss.result(),\n", |
| 480 | + " test_accuracy.result() * 100))\n", |
490 | 481 | "\n",
|
491 | 482 | " test_loss.reset_states()\n",
|
492 | 483 | " train_accuracy.reset_states()\n",
|
|
501 | 492 | "source": [
|
502 | 493 | "Things to note in the example above:\n",
|
503 | 494 | "\n",
|
504 |
| - "* We are iterating over the `train_dist_dataset` and `test_dist_dataset` using a `for x in ...` construct.\n", |
| 495 | + "* Iterate over the `train_dist_dataset` and `test_dist_dataset` using a `for x in ...` construct.\n", |
505 | 496 | "* The scaled loss is the return value of the `distributed_train_step`. This value is aggregated across replicas using the `tf.distribute.Strategy.reduce` call and then across batches by summing the return value of the `tf.distribute.Strategy.reduce` calls.\n",
|
506 | 497 | "* `tf.keras.Metrics` should be updated inside `train_step` and `test_step` that gets executed by `tf.distribute.Strategy.run`.\n",
|
507 | 498 | "*`tf.distribute.Strategy.run` returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can do `tf.distribute.Strategy.reduce` to get an aggregated value. You can also do `tf.distribute.Strategy.experimental_local_results` to get the list of values contained in the result, one per local replica.\n"
|
|
570 | 561 | "for images, labels in test_dataset:\n",
|
571 | 562 | " eval_step(images, labels)\n",
|
572 | 563 | "\n",
|
573 |
| - "print ('Accuracy after restoring the saved model without strategy: {}'.format(\n", |
574 |
| - " eval_accuracy.result()*100))" |
| 564 | + "print('Accuracy after restoring the saved model without strategy: {}'.format(\n", |
| 565 | + " eval_accuracy.result() * 100))" |
575 | 566 | ]
|
576 | 567 | },
|
577 | 568 | {
|
|
606 | 597 | " average_train_loss = total_loss / num_batches\n",
|
607 | 598 | "\n",
|
608 | 599 | " template = (\"Epoch {}, Loss: {}, Accuracy: {}\")\n",
|
609 |
| - " print (template.format(epoch+1, average_train_loss, train_accuracy.result()*100))\n", |
| 600 | + " print(template.format(epoch + 1, average_train_loss, train_accuracy.result() * 100))\n", |
610 | 601 | " train_accuracy.reset_states()"
|
611 | 602 | ]
|
612 | 603 | },
|
|
617 | 608 | },
|
618 | 609 | "source": [
|
619 | 610 | "### Iterating inside a tf.function\n",
|
620 |
| - "You can also iterate over the entire input `train_dist_dataset` inside a tf.function using the `for x in ...` construct or by creating iterators like we did above. The example below demonstrates wrapping one epoch of training in a tf.function and iterating over `train_dist_dataset` inside the function." |
| 611 | + "You can also iterate over the entire input `train_dist_dataset` inside a `tf.function` using the `for x in ...` construct or by creating iterators like you did above. The example below demonstrates wrapping one epoch of training with a `@tf.function` decorator and iterating over `train_dist_dataset` inside the function." |
621 | 612 | ]
|
622 | 613 | },
|
623 | 614 | {
|
|
643 | 634 | " train_loss = distributed_train_epoch(train_dist_dataset)\n",
|
644 | 635 | "\n",
|
645 | 636 | " template = (\"Epoch {}, Loss: {}, Accuracy: {}\")\n",
|
646 |
| - " print (template.format(epoch+1, train_loss, train_accuracy.result()*100))\n", |
| 637 | + " print(template.format(epoch + 1, train_loss, train_accuracy.result() * 100))\n", |
647 | 638 | "\n",
|
648 | 639 | " train_accuracy.reset_states()"
|
649 | 640 | ]
|
|
658 | 649 | "\n",
|
659 | 650 | "Note: As a general rule, you should use `tf.keras.Metrics` to track per-sample values and avoid values that have been aggregated within a replica.\n",
|
660 | 651 | "\n",
|
661 |
| - "We do *not* recommend using `tf.metrics.Mean` to track the training loss across different replicas, because of the loss scaling computation that is carried out.\n", |
| 652 | + "Because of the loss scaling computation that is carried out, it's not recommended to use `tf.metrics.Mean` to track the training loss across different replicas.\n", |
662 | 653 | "\n",
|
663 | 654 | "For example, if you run a training job with the following characteristics:\n",
|
664 | 655 | "* Two replicas\n",
|
|
699 | 690 | "## Next steps\n",
|
700 | 691 | "\n",
|
701 | 692 | "* Try out the new `tf.distribute.Strategy` API on your models.\n",
|
702 |
| - "* Visit the [Performance section](../../guide/function.ipynb) in the guide to learn more about other strategies and [tools](../../guide/profiler.md) you can use to optimize the performance of your TensorFlow models." |
| 693 | + "* Visit the [Better performance with tf.function](../../guide/function.ipynb) and [TensorFlow Profiler](../../guide/profiler.md) guide to learn more about tools to optimize the performance of your TensorFlow models.\n", |
| 694 | + "* The [Distributed training in TensorFlow](../../guide/distributed_training.ipynb) guide provides an overview of the available distribution strategies." |
703 | 695 | ]
|
704 | 696 | }
|
705 | 697 | ],
|
|
0 commit comments