@@ -2,6 +2,8 @@ Writing Distributed Applications with PyTorch
22=============================================
33**Author **: `Séb Arnold <https://seba1511.com >`_
44
5+ **Edited by **: `Chirag Pandya <https://github.com/c-p-i-o >`_
6+
57.. note ::
68 |edit | View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_tuto.rst >`__.
79
@@ -38,7 +40,7 @@ simultaneously. If you have access to compute cluster you should check
3840with your local sysadmin or use your favorite coordination tool (e.g.,
3941`pdsh <https://linux.die.net/man/1/pdsh >`__,
4042`clustershell <https://cea-hpc.github.io/clustershell/ >`__, or
41- `others <https://slurm.schedmd.com/ >`__). For the purpose of this
43+ `slurm <https://slurm.schedmd.com/ >`__). For the purpose of this
4244tutorial, we will use a single machine and spawn multiple processes using
4345the following template.
4446
@@ -64,11 +66,11 @@ the following template.
6466
6567
6668 if __name__ == " __main__" :
67- size = 2
69+ world_size = 2
6870 processes = []
6971 mp.set_start_method(" spawn" )
70- for rank in range (size ):
71- p = mp.Process(target = init_process, args = (rank, size , run))
72+ for rank in range (world_size ):
73+ p = mp.Process(target = init_process, args = (rank, world_size , run))
7274 p.start()
7375 processes.append(p)
7476
@@ -125,7 +127,7 @@ process 0 increments the tensor and sends it to process 1 so that they
125127both end up with 1.0. Notice that process 1 needs to allocate memory in
126128order to store the data it will receive.
127129
128- Also notice that ``send ``/`` recv `` are **blocking **: both processes stop
130+ Also notice that ``send/ recv `` are **blocking **: both processes block
129131until the communication is completed. On the other hand immediates are
130132**non-blocking **; the script continues its execution and the methods
131133return a ``Work `` object upon which we can choose to
@@ -219,16 +221,23 @@ to obtain the sum of all tensors on all processes, we can use the
219221 Since we want the sum of all tensors in the group, we use
220222``dist.ReduceOp.SUM `` as the reduce operator. Generally speaking, any
221223commutative mathematical operation can be used as an operator.
222- Out-of-the-box, PyTorch comes with 4 such operators, all working at the
224+ Out-of-the-box, PyTorch comes with many such operators, all working at the
223225element-wise level:
224226
225227- ``dist.ReduceOp.SUM ``,
226228- ``dist.ReduceOp.PRODUCT ``,
227229- ``dist.ReduceOp.MAX ``,
228- - ``dist.ReduceOp.MIN ``.
230+ - ``dist.ReduceOp.MIN ``,
231+ - ``dist.ReduceOp.BAND ``,
232+ - ``dist.ReduceOp.BOR ``,
233+ - ``dist.ReduceOp.BXOR ``,
234+ - ``dist.ReduceOp.PREMUL_SUM ``.
235+
236+ The full list of supported operators is
237+ `here <https://pytorch.org/docs/stable/distributed.html#torch.distributed.ReduceOp >`__.
229238
230- In addition to ``dist.all_reduce(tensor, op, group) ``, there are a total
231- of 6 collectives currently implemented in PyTorch .
239+ In addition to ``dist.all_reduce(tensor, op, group) ``, there are many additional collectives currently implemented in
240+ PyTorch. Here are a few supported collectives .
232241
233242- ``dist.broadcast(tensor, src, group) ``: Copies ``tensor `` from
234243 ``src `` to all other processes.
@@ -244,6 +253,12 @@ of 6 collectives currently implemented in PyTorch.
244253- ``dist.all_gather(tensor_list, tensor, group) ``: Copies ``tensor ``
245254 from all processes to ``tensor_list ``, on all processes.
246255- ``dist.barrier(group) ``: Blocks all processes in `group ` until each one has entered this function.
256+ - ``dist.all_to_all(output_tensor_list, input_tensor_list, group) ``: Scatters list of input tensors to all processes in
257+ a group and return gathered list of tensors in output list.
258+
259+ The full list of supported collectives can be found by looking at the latest documentation for PyTorch Distributed
260+ `(link) <https://pytorch.org/docs/stable/distributed.html >`__.
261+
247262
248263Distributed Training
249264--------------------
@@ -275,7 +290,7 @@ gradients of their model on their batch of data and then average their
275290gradients. In order to ensure similar convergence results when changing
276291the number of processes, we will first have to partition our dataset.
277292(You could also use
278- `tnt.dataset.SplitDataset <https://github.com/pytorch/tnt/blob/master/torchnet/dataset/splitdataset.py#L4 >`__,
293+ `torch.utils.data.random_split <https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split >`__,
279294instead of the snippet below.)
280295
281296.. code :: python
@@ -389,7 +404,7 @@ could train any model on a large computer cluster.
389404lot more tricks <https://seba-1511.github.io/dist_blog> `__ required to
390405implement a production-level implementation of synchronous SGD. Again,
391406use what `has been tested and
392- optimized <https://pytorch.org/docs/stable/nn .html#torch.nn.parallel.DistributedDataParallel> `__.
407+ optimized <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel .html#torch.nn.parallel.DistributedDataParallel> `__.
393408
394409Our Own Ring-Allreduce
395410~~~~~~~~~~~~~~~~~~~~~~
@@ -451,8 +466,9 @@ Communication Backends
451466
452467One of the most elegant aspects of ``torch.distributed `` is its ability
453468to abstract and build on top of different backends. As mentioned before,
454- there are currently three backends implemented in PyTorch: Gloo, NCCL, and
455- MPI. They each have different specifications and tradeoffs, depending
469+ there are multiple backends implemented in PyTorch.
470+ Some of the most popular ones are Gloo, NCCL, and MPI.
471+ They each have different specifications and tradeoffs, depending
456472on the desired use case. A comparative table of supported functions can
457473be found
458474`here <https://pytorch.org/docs/stable/distributed.html#module-torch.distributed >`__.
@@ -544,15 +560,15 @@ NCCL backend is included in the pre-built binaries with CUDA support.
544560Initialization Methods
545561~~~~~~~~~~~~~~~~~~~~~~
546562
547- To finish this tutorial, let's talk about the very first function we
548- called: ``dist.init_process_group(backend, init_method) ``. In
549- particular, we will go over the different initialization methods which
550- are responsible for the initial coordination step between each process .
551- Those methods allow you to define how this coordination is done.
552- Depending on your hardware setup, one of these methods should be
553- naturally more suitable than the others. In addition to the following
554- sections, you should also have a look at the ` official
555- documentation <https://pytorch.org/docs/stable/distributed.html#initialization> `__.
563+ To conclude this tutorial, let's examine the initial function we invoked:
564+ ``dist.init_process_group(backend, init_method) ``. Specifically, we will discuss the various
565+ initialization methods responsible for the preliminary coordination step between each process.
566+ These methods enable you to define how this coordination is accomplished .
567+
568+ The choice of initialization method depends on your hardware setup, and one method may be more
569+ suitable than others. In addition to the following sections, please refer to the ` official
570+ documentation <https://pytorch.org/docs/stable/distributed.html#initialization> `__ for further information.
571+
556572
557573**Environment Variable **
558574
0 commit comments