@@ -207,6 +207,67 @@ After attaching, the estimator can be deployed as usual.
207207
208208 tf_estimator = TensorFlow.attach(training_job_name = training_job_name)
209209
210+ Distributed Training
211+ ''''''''''''''''''''
212+
213+ To run your training job with multiple instances in a distributed fashion, set ``train_instance_count ``
214+ to a number larger than 1. We support two different types of distributed training, parameter server and Horovod.
215+ The ``distributions `` parameter is used to configure which distributed training strategy to use.
216+
217+ Training with parameter servers
218+ """""""""""""""""""""""""""""""
219+
220+ If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server
221+ thread on each instance in the training cluster, and then executes your training code. You can find more information on
222+ TensorFlow distributed training at `TensorFlow docs <https://www.tensorflow.org/deploy/distributed >`__.
223+ To enable parameter server training:
224+
225+ .. code :: python
226+
227+ from sagemaker.tensorflow import TensorFlow
228+
229+ tf_estimator = TensorFlow(entry_point = ' tf-train.py' , role = ' SageMakerRole' ,
230+ train_instance_count = 2 , train_instance_type = ' ml.p2.xlarge' ,
231+ framework_version = ' 1.11' , py_version = ' py3' ,
232+ distributions = {' parameter_server' : {' enabled' : True }})
233+ tf_estimator.fit(' s3://bucket/path/to/training/data' )
234+
235+ Training with Horovod
236+ """""""""""""""""""""
237+
238+ Horovod is a distributed training framework based on MPI. You can find more details at `Horovod README <https://github.com/uber/horovod >`__.
239+
240+ The container sets up the MPI environment and executes the ``mpirun `` command enabling you to run any Horovod
241+ training script with Script Mode.
242+
243+ Training with ``MPI `` is configured by specifying following fields in ``distributions ``:
244+
245+ - ``enabled (bool) ``: If set to ``True ``, the MPI setup is performed and ``mpirun `` command is executed.
246+ - ``processes_per_host (int) ``: Number of processes MPI should launch on each host. Note, this should not be
247+ greater than the available slots on the selected instance type. This flag should be set for the multi-cpu/gpu
248+ training.
249+ - ``custom_mpi_options (str) ``: Any `mpirun ` flag(s) can be passed in this field that will be added to the `mpirun `
250+ command executed by SageMaker to launch distributed horovod training.
251+
252+
253+ In the below example we create an estimator to launch Horovod distributed training with 2 processes on one host:
254+
255+ .. code :: python
256+
257+ from sagemaker.tensorflow import TensorFlow
258+
259+ tf_estimator = TensorFlow(entry_point = ' tf-train.py' , role = ' SageMakerRole' ,
260+ train_instance_count = 1 , train_instance_type = ' ml.p2.xlarge' ,
261+ framework_version = ' 1.12' , py_version = ' py3' ,
262+ distributions = {
263+ ' mpi' : {
264+ ' enabled' : True ,
265+ ' processes_per_host' : 2 ,
266+ ' custom_mpi_options' : ' --NCCL_DEBUG INFO'
267+ }
268+ })
269+ tf_estimator.fit(' s3://bucket/path/to/training/data' )
270+
210271 sagemaker.tensorflow.TensorFlow class
211272'''''''''''''''''''''''''''''''''''''
212273
@@ -277,11 +338,10 @@ Optional:
277338- ``model_dir (str) `` Location where model data, checkpoint data, and TensorBoard checkpoints should be saved during training.
278339 If not specified a S3 location will be generated under the training job's default bucket. And ``model_dir `` will be
279340 passed in your training script as one of the command line arguments.
280- - ``distributions (dict) `` Configure your distrubtion strategy with this argument. For launching parameter server for
281- for distributed training, you must set ``distributions `` to ``{'parameter_server': {'enabled': True}} ``
341+ - ``distributions (dict) `` Configure your distribution strategy with this argument.
282342
283343Training with Pipe Mode using PipeModeDataset
284- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
344+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
285345
286346Amazon SageMaker allows users to create training jobs using Pipe input mode.
287347With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first.
@@ -327,9 +387,9 @@ To run training job with Pipe input mode, pass in ``input_mode='Pipe'`` to your
327387 from sagemaker.tensorflow import TensorFlow
328388
329389 tf_estimator = TensorFlow(entry_point = ' tf-train-with-pipemodedataset.py' , role = ' SageMakerRole' ,
330- training_steps = 10000 , evaluation_steps = 100 ,
331- train_instance_count = 1 , train_instance_type = ' ml.p2.xlarge' ,
332- framework_version = ' 1.10.0' , input_mode = ' Pipe' )
390+ training_steps = 10000 , evaluation_steps = 100 ,
391+ train_instance_count = 1 , train_instance_type = ' ml.p2.xlarge' ,
392+ framework_version = ' 1.10.0' , input_mode = ' Pipe' )
333393
334394 tf_estimator.fit(' s3://bucket/path/to/training/data' )
335395
0 commit comments