|
76 | 76 | "\n",
|
77 | 77 | "You will use the `tf.keras` APIs to build the model and `Model.fit` for training it. (To learn about distributed training with a custom training loop and the `MirroredStrategy`, check out [this tutorial](custom_training.ipynb).)\n",
|
78 | 78 | "\n",
|
79 |
| - "`MirroredStrategy` trains your model on multiple GPUs on a single machine. For _synchronous training on many GPUs on multiple workers_, use the `tf.distribute.MultiWorkerMirroredStrategy` [with the Keras Model.fit](multi_worker_with_keras.ipynb) or [a custom training loop](multi_worker_with_ctl.ipynb). For other options, refer to the [Distributed training guide](../../guide/distributed_training.ipynb).\n", |
| 79 | + "`MirroredStrategy` trains your model on multiple GPUs on a single machine. For _synchronous training on many GPUs on multiple workers_, use the `tf.distribute.MultiWorkerMirroredStrategy` with the [Keras Model.fit](multi_worker_with_keras.ipynb) or [a custom training loop](multi_worker_with_ctl.ipynb). For other options, refer to the [Distributed training guide](../../guide/distributed_training.ipynb).\n", |
80 | 80 | "\n",
|
81 | 81 | "To learn about various other strategies, there is the [Distributed training with TensorFlow](../../guide/distributed_training.ipynb) guide."
|
82 | 82 | ]
|
|
289 | 289 | "id": "1BnQYQTpB3YA"
|
290 | 290 | },
|
291 | 291 | "source": [
|
292 |
| - "Create and compile the Keras model in the context of `Strategy.scope`:" |
| 292 | + "Within the context of `Strategy.scope`, create and compile the model using the Keras API:" |
293 | 293 | ]
|
294 | 294 | },
|
295 | 295 | {
|
|
329 | 329 | "id": "YOXO5nvvK3US"
|
330 | 330 | },
|
331 | 331 | "source": [
|
332 |
| - "Define the following `tf.keras.callbacks`:\n", |
| 332 | + "Define the following [Keras Callbacks](https://www.tensorflow.org/guide/keras/train_and_evaluate):\n", |
333 | 333 | "\n",
|
334 | 334 | "- `tf.keras.callbacks.TensorBoard`: writes a log for TensorBoard, which allows you to visualize the graphs.\n",
|
335 | 335 | "- `tf.keras.callbacks.ModelCheckpoint`: saves the model at a certain frequency, such as after every epoch.\n",
|
| 336 | + "- `tf.keras.callbacks.BackupAndRestore`: provides the fault tolerance functionality by backing up the model and current epoch number. Learn more in the _Fault tolerance_ section of the [Multi-worker training with Keras](multi_worker_with_keras.ipynb) tutorial.\n", |
336 | 337 | "- `tf.keras.callbacks.LearningRateScheduler`: schedules the learning rate to change after, for example, every epoch/batch.\n",
|
337 | 338 | "\n",
|
338 |
| - "For illustrative purposes, add a custom callback called `PrintLR` to display the *learning rate* in the notebook." |
| 339 | + "For illustrative purposes, add a [custom callback](https://www.tensorflow.org/guide/keras/custom_callback) called `PrintLR` to display the *learning rate* in the notebook.\n", |
| 340 | + "\n", |
| 341 | + "**Note:** Use the `BackupAndRestore` callback instead of `ModelCheckpoint` as the main mechanism to restore the training state upon a restart from a job failure. Since `BackupAndRestore` only supports eager mode, in graph mode consider using `ModelCheckpoint`." |
339 | 342 | ]
|
340 | 343 | },
|
341 | 344 | {
|
|
382 | 385 | "# Define a callback for printing the learning rate at the end of each epoch.\n",
|
383 | 386 | "class PrintLR(tf.keras.callbacks.Callback):\n",
|
384 | 387 | " def on_epoch_end(self, epoch, logs=None):\n",
|
385 |
| - " print('\\nLearning rate for epoch {} is {}'.format(epoch + 1,\n", |
386 |
| - " model.optimizer.lr.numpy()))" |
| 388 | + " print('\\nLearning rate for epoch {} is {}'.format(", |
| 389 | + " epoch + 1, model.optimizer.lr.numpy()))" |
387 | 390 | ]
|
388 | 391 | },
|
389 | 392 | {
|
|
419 | 422 | "id": "6EophnOAB3YD"
|
420 | 423 | },
|
421 | 424 | "source": [
|
422 |
| - "Now, train the model in the usual way by calling `Model.fit` on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not." |
| 425 | + "Now, train the model in the usual way by calling Keras `Model.fit` on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not." |
423 | 426 | ]
|
424 | 427 | },
|
425 | 428 | {
|
|
535 | 538 | "id": "Xa87y_A0vRma"
|
536 | 539 | },
|
537 | 540 | "source": [
|
538 |
| - "Export the graph and the variables to the platform-agnostic SavedModel format using `Model.save`. After your model is saved, you can load it with or without the `Strategy.scope`." |
| 541 | + "Export the graph and the variables to the platform-agnostic SavedModel format using Keras `Model.save`. After your model is saved, you can load it with or without the `Strategy.scope`." |
539 | 542 | ]
|
540 | 543 | },
|
541 | 544 | {
|
|
626 | 629 | "\n",
|
627 | 630 | "More examples that use different distribution strategies with the Keras `Model.fit` API:\n",
|
628 | 631 | "\n",
|
629 |
| - "1. The [Solve GLUE tasks using BERT on TPU](https://www.tensorflow.org/text/tutorials/bert_glue) tutorial uses `tf.distribute.MirroredStrategy` for training on GPUs and `tf.distribute.TPUStrategy`—on TPUs.\n", |
| 632 | + "1. The [Solve GLUE tasks using BERT on TPU](https://www.tensorflow.org/text/tutorials/bert_glue) tutorial uses `tf.distribute.MirroredStrategy` for training on GPUs and `tf.distribute.TPUStrategy` on TPUs.\n", |
630 | 633 | "1. The [Save and load a model using a distribution strategy](save_and_load.ipynb) tutorial demonstates how to use the SavedModel APIs with `tf.distribute.Strategy`.\n",
|
631 | 634 | "1. The [official TensorFlow models](https://github.com/tensorflow/models/tree/master/official) can be configured to run multiple distribution strategies.\n",
|
632 | 635 | "\n",
|
|
0 commit comments