Skip to content

Commit a1c5b3c

Browse files
Merge pull request #2002 from Obliman:master
PiperOrigin-RevId: 426301644
2 parents 644d622 + 02b9b04 commit a1c5b3c

File tree

1 file changed

+21
-20
lines changed

1 file changed

+21
-20
lines changed

site/en/tutorials/distribute/multi_worker_with_keras.ipynb

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@
194194
"id": "fLW6D2TzvC-4"
195195
},
196196
"source": [
197-
"Next, create an `mnist.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:"
197+
"Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:"
198198
]
199199
},
200200
{
@@ -205,7 +205,7 @@
205205
},
206206
"outputs": [],
207207
"source": [
208-
"%%writefile mnist.py\n",
208+
"%%writefile mnist_setup.py\n",
209209
"\n",
210210
"import os\n",
211211
"import tensorflow as tf\n",
@@ -256,11 +256,11 @@
256256
},
257257
"outputs": [],
258258
"source": [
259-
"import mnist\n",
259+
"import mnist_setup\n",
260260
"\n",
261261
"batch_size = 64\n",
262-
"single_worker_dataset = mnist.mnist_dataset(batch_size)\n",
263-
"single_worker_model = mnist.build_and_compile_cnn_model()\n",
262+
"single_worker_dataset = mnist_setup.mnist_dataset(batch_size)\n",
263+
"single_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
264264
"single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70)"
265265
]
266266
},
@@ -439,7 +439,7 @@
439439
"\n",
440440
"This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
441441
"\n",
442-
"`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy."
442+
"`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy."
443443
]
444444
},
445445
{
@@ -459,7 +459,7 @@
459459
"id": "N0iv7SyyAohc"
460460
},
461461
"source": [
462-
"Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy()` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training."
462+
"Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training."
463463
]
464464
},
465465
{
@@ -468,7 +468,7 @@
468468
"id": "FMy2VM4Akzpr"
469469
},
470470
"source": [
471-
"`MultiWorkerMirroredStrategy` provides multiple implementations via the [`CommunicationOptions`](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/CommunicationOptions) parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster."
471+
"`MultiWorkerMirroredStrategy` provides multiple implementations via the `tf.distribute.experimental.CommunicationOptions` parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster."
472472
]
473473
},
474474
{
@@ -492,7 +492,7 @@
492492
"source": [
493493
"with strategy.scope():\n",
494494
" # Model building/compiling need to be within `strategy.scope()`.\n",
495-
" multi_worker_model = mnist.build_and_compile_cnn_model()"
495+
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()"
496496
]
497497
},
498498
{
@@ -512,7 +512,7 @@
512512
"source": [
513513
"To actually run with `MultiWorkerMirroredStrategy` you'll need to run worker processes and pass a `TF_CONFIG` to them.\n",
514514
"\n",
515-
"Like the `mnist.py` file written earlier, here is the `main.py` that each of the workers will run:"
515+
"Like the `mnist_setup.py` file written earlier, here is the `main.py` that each of the workers will run:"
516516
]
517517
},
518518
{
@@ -529,7 +529,7 @@
529529
"import json\n",
530530
"\n",
531531
"import tensorflow as tf\n",
532-
"import mnist\n",
532+
"import mnist_setup\n",
533533
"\n",
534534
"per_worker_batch_size = 64\n",
535535
"tf_config = json.loads(os.environ['TF_CONFIG'])\n",
@@ -538,11 +538,11 @@
538538
"strategy = tf.distribute.MultiWorkerMirroredStrategy()\n",
539539
"\n",
540540
"global_batch_size = per_worker_batch_size * num_workers\n",
541-
"multi_worker_dataset = mnist.mnist_dataset(global_batch_size)\n",
541+
"multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)\n",
542542
"\n",
543543
"with strategy.scope():\n",
544544
" # Model building/compiling need to be within `strategy.scope()`.\n",
545-
" multi_worker_model = mnist.build_and_compile_cnn_model()\n",
545+
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
546546
"\n",
547547
"\n",
548548
"multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)"
@@ -820,7 +820,7 @@
820820
"options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF\n",
821821
"\n",
822822
"global_batch_size = 64\n",
823-
"multi_worker_dataset = mnist.mnist_dataset(batch_size=64)\n",
823+
"multi_worker_dataset = mnist_setup.mnist_dataset(batch_size=64)\n",
824824
"dataset_no_auto_shard = multi_worker_dataset.with_options(options)"
825825
]
826826
},
@@ -882,7 +882,7 @@
882882
"\n",
883883
"When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
884884
"\n",
885-
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
885+
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, which also adds the support to single-worker training for a consistent experience, and removed the fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new `BackupAndRestore` callback."
886886
]
887887
},
888888
{
@@ -1129,8 +1129,9 @@
11291129
"\n",
11301130
"The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
11311131
"\n",
1132-
"Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n",
1133-
"Below are two examples for both multi-worker training and single worker training."
1132+
"Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n",
1133+
"\n",
1134+
"Below are two examples for both multi-worker training and single-worker training:"
11341135
]
11351136
},
11361137
{
@@ -1141,12 +1142,12 @@
11411142
},
11421143
"outputs": [],
11431144
"source": [
1144-
"# Multi-worker training with MultiWorkerMirroredStrategy\n",
1145-
"# and the BackupAndRestore callback.\n",
1145+
"# Multi-worker training with `MultiWorkerMirroredStrategy`\n",
1146+
"# and the `BackupAndRestore` callback.\n",
11461147
"\n",
11471148
"callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n",
11481149
"with strategy.scope():\n",
1149-
" multi_worker_model = mnist.build_and_compile_cnn_model()\n",
1150+
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
11501151
"multi_worker_model.fit(multi_worker_dataset,\n",
11511152
" epochs=3,\n",
11521153
" steps_per_epoch=70,\n",

0 commit comments

Comments
 (0)