|
194 | 194 | "id": "fLW6D2TzvC-4"
|
195 | 195 | },
|
196 | 196 | "source": [
|
197 |
| - "Next, create an `mnist.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:" |
| 197 | + "Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:" |
198 | 198 | ]
|
199 | 199 | },
|
200 | 200 | {
|
|
205 | 205 | },
|
206 | 206 | "outputs": [],
|
207 | 207 | "source": [
|
208 |
| - "%%writefile mnist.py\n", |
| 208 | + "%%writefile mnist_setup.py\n", |
209 | 209 | "\n",
|
210 | 210 | "import os\n",
|
211 | 211 | "import tensorflow as tf\n",
|
|
256 | 256 | },
|
257 | 257 | "outputs": [],
|
258 | 258 | "source": [
|
259 |
| - "import mnist\n", |
| 259 | + "import mnist_setup\n", |
260 | 260 | "\n",
|
261 | 261 | "batch_size = 64\n",
|
262 |
| - "single_worker_dataset = mnist.mnist_dataset(batch_size)\n", |
263 |
| - "single_worker_model = mnist.build_and_compile_cnn_model()\n", |
| 262 | + "single_worker_dataset = mnist_setup.mnist_dataset(batch_size)\n", |
| 263 | + "single_worker_model = mnist_setup.build_and_compile_cnn_model()\n", |
264 | 264 | "single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70)"
|
265 | 265 | ]
|
266 | 266 | },
|
|
439 | 439 | "\n",
|
440 | 440 | "This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
|
441 | 441 | "\n",
|
442 |
| - "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy." |
| 442 | + "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy." |
443 | 443 | ]
|
444 | 444 | },
|
445 | 445 | {
|
|
459 | 459 | "id": "N0iv7SyyAohc"
|
460 | 460 | },
|
461 | 461 | "source": [
|
462 |
| - "Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy()` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training." |
| 462 | + "Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training." |
463 | 463 | ]
|
464 | 464 | },
|
465 | 465 | {
|
|
468 | 468 | "id": "FMy2VM4Akzpr"
|
469 | 469 | },
|
470 | 470 | "source": [
|
471 |
| - "`MultiWorkerMirroredStrategy` provides multiple implementations via the [`CommunicationOptions`](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/CommunicationOptions) parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster." |
| 471 | + "`MultiWorkerMirroredStrategy` provides multiple implementations via the `tf.distribute.experimental.CommunicationOptions` parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster." |
472 | 472 | ]
|
473 | 473 | },
|
474 | 474 | {
|
|
492 | 492 | "source": [
|
493 | 493 | "with strategy.scope():\n",
|
494 | 494 | " # Model building/compiling need to be within `strategy.scope()`.\n",
|
495 |
| - " multi_worker_model = mnist.build_and_compile_cnn_model()" |
| 495 | + " multi_worker_model = mnist_setup.build_and_compile_cnn_model()" |
496 | 496 | ]
|
497 | 497 | },
|
498 | 498 | {
|
|
512 | 512 | "source": [
|
513 | 513 | "To actually run with `MultiWorkerMirroredStrategy` you'll need to run worker processes and pass a `TF_CONFIG` to them.\n",
|
514 | 514 | "\n",
|
515 |
| - "Like the `mnist.py` file written earlier, here is the `main.py` that each of the workers will run:" |
| 515 | + "Like the `mnist_setup.py` file written earlier, here is the `main.py` that each of the workers will run:" |
516 | 516 | ]
|
517 | 517 | },
|
518 | 518 | {
|
|
529 | 529 | "import json\n",
|
530 | 530 | "\n",
|
531 | 531 | "import tensorflow as tf\n",
|
532 |
| - "import mnist\n", |
| 532 | + "import mnist_setup\n", |
533 | 533 | "\n",
|
534 | 534 | "per_worker_batch_size = 64\n",
|
535 | 535 | "tf_config = json.loads(os.environ['TF_CONFIG'])\n",
|
|
538 | 538 | "strategy = tf.distribute.MultiWorkerMirroredStrategy()\n",
|
539 | 539 | "\n",
|
540 | 540 | "global_batch_size = per_worker_batch_size * num_workers\n",
|
541 |
| - "multi_worker_dataset = mnist.mnist_dataset(global_batch_size)\n", |
| 541 | + "multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)\n", |
542 | 542 | "\n",
|
543 | 543 | "with strategy.scope():\n",
|
544 | 544 | " # Model building/compiling need to be within `strategy.scope()`.\n",
|
545 |
| - " multi_worker_model = mnist.build_and_compile_cnn_model()\n", |
| 545 | + " multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n", |
546 | 546 | "\n",
|
547 | 547 | "\n",
|
548 | 548 | "multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)"
|
|
820 | 820 | "options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF\n",
|
821 | 821 | "\n",
|
822 | 822 | "global_batch_size = 64\n",
|
823 |
| - "multi_worker_dataset = mnist.mnist_dataset(batch_size=64)\n", |
| 823 | + "multi_worker_dataset = mnist_setup.mnist_dataset(batch_size=64)\n", |
824 | 824 | "dataset_no_auto_shard = multi_worker_dataset.with_options(options)"
|
825 | 825 | ]
|
826 | 826 | },
|
|
882 | 882 | "\n",
|
883 | 883 | "When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
|
884 | 884 | "\n",
|
885 |
| - "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback." |
| 885 | + "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, which also adds the support to single-worker training for a consistent experience, and removed the fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new `BackupAndRestore` callback." |
886 | 886 | ]
|
887 | 887 | },
|
888 | 888 | {
|
|
1129 | 1129 | "\n",
|
1130 | 1130 | "The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
|
1131 | 1131 | "\n",
|
1132 |
| - "Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n", |
1133 |
| - "Below are two examples for both multi-worker training and single worker training." |
| 1132 | + "Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n", |
| 1133 | + "\n", |
| 1134 | + "Below are two examples for both multi-worker training and single-worker training:" |
1134 | 1135 | ]
|
1135 | 1136 | },
|
1136 | 1137 | {
|
|
1141 | 1142 | },
|
1142 | 1143 | "outputs": [],
|
1143 | 1144 | "source": [
|
1144 |
| - "# Multi-worker training with MultiWorkerMirroredStrategy\n", |
1145 |
| - "# and the BackupAndRestore callback.\n", |
| 1145 | + "# Multi-worker training with `MultiWorkerMirroredStrategy`\n", |
| 1146 | + "# and the `BackupAndRestore` callback.\n", |
1146 | 1147 | "\n",
|
1147 | 1148 | "callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n",
|
1148 | 1149 | "with strategy.scope():\n",
|
1149 |
| - " multi_worker_model = mnist.build_and_compile_cnn_model()\n", |
| 1150 | + " multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n", |
1150 | 1151 | "multi_worker_model.fit(multi_worker_dataset,\n",
|
1151 | 1152 | " epochs=3,\n",
|
1152 | 1153 | " steps_per_epoch=70,\n",
|
|
0 commit comments