[doc][c10d] fix up distributed tutorial

c-p-i-o · c-p-i-o · commit c536f837e777 · 2024-10-30T14:24:44.000-07:00
Summary:
Minor fixups to the distributed tutorial.
1. Fix broken links.

Test Plan:
Ran the referenced code locally to make sure that it works.

Reviewers:

Subscribers:

Tasks:

Tags:
diff --git a/intermediate_source/dist_tuto.rst b/intermediate_source/dist_tuto.rst
@@ -2,6 +2,8 @@ Writing Distributed Applications with PyTorch
 =============================================
 **Author**: `Séb Arnold <https://seba1511.com>`_
 
+**Edited by**: `Chirag Pandya <https://github.com/c-p-i-o>`_
+
 .. note::
    |edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_tuto.rst>`__.
 
@@ -38,7 +40,7 @@ simultaneously. If you have access to compute cluster you should check
 with your local sysadmin or use your favorite coordination tool (e.g.,
 `pdsh <https://linux.die.net/man/1/pdsh>`__,
 `clustershell <https://cea-hpc.github.io/clustershell/>`__, or
-`others <https://slurm.schedmd.com/>`__). For the purpose of this
+`slurm <https://slurm.schedmd.com/>`__). For the purpose of this
 tutorial, we will use a single machine and spawn multiple processes using
 the following template.
 
@@ -64,11 +66,11 @@ the following template.
 
 
     if __name__ == "__main__":
-        size = 2
+        world_size = 2
         processes = []
         mp.set_start_method("spawn")
-        for rank in range(size):
-            p = mp.Process(target=init_process, args=(rank, size, run))
+        for rank in range(world_size):
+            p = mp.Process(target=init_process, args=(rank, world_size, run))
             p.start()
             processes.append(p)
 
@@ -125,7 +127,7 @@ process 0 increments the tensor and sends it to process 1 so that they
 both end up with 1.0. Notice that process 1 needs to allocate memory in
 order to store the data it will receive.
 
-Also notice that ``send``/``recv`` are **blocking**: both processes stop
+Also notice that ``send/recv`` are **blocking**: both processes block
 until the communication is completed. On the other hand immediates are
 **non-blocking**; the script continues its execution and the methods
 return a ``Work`` object upon which we can choose to
@@ -219,16 +221,23 @@ to obtain the sum of all tensors on all processes, we can use the
 Since we want the sum of all tensors in the group, we use
 ``dist.ReduceOp.SUM`` as the reduce operator. Generally speaking, any
 commutative mathematical operation can be used as an operator.
-Out-of-the-box, PyTorch comes with 4 such operators, all working at the
+Out-of-the-box, PyTorch comes with many such operators, all working at the
 element-wise level:
 
 -  ``dist.ReduceOp.SUM``,
 -  ``dist.ReduceOp.PRODUCT``,
 -  ``dist.ReduceOp.MAX``,
--  ``dist.ReduceOp.MIN``.
+-  ``dist.ReduceOp.MIN``,
+-  ``dist.ReduceOp.BAND``,
+-  ``dist.ReduceOp.BOR``,
+-  ``dist.ReduceOp.BXOR``,
+-  ``dist.ReduceOp.PREMUL_SUM``.
+
+The full list of supported operators is
+`here <https://pytorch.org/docs/stable/distributed.html#torch.distributed.ReduceOp>`__.
 
-In addition to ``dist.all_reduce(tensor, op, group)``, there are a total
-of 6 collectives currently implemented in PyTorch.
+In addition to ``dist.all_reduce(tensor, op, group)``, there are many additional collectives currently implemented in
+PyTorch. Here are a few supported collectives.
 
 -  ``dist.broadcast(tensor, src, group)``: Copies ``tensor`` from
    ``src`` to all other processes.
@@ -244,6 +253,12 @@ of 6 collectives currently implemented in PyTorch.
 -  ``dist.all_gather(tensor_list, tensor, group)``: Copies ``tensor``
    from all processes to ``tensor_list``, on all processes.
 -  ``dist.barrier(group)``: Blocks all processes in `group` until each one has entered this function.
+-  ``dist.all_to_all(output_tensor_list, input_tensor_list, group)``: Scatters list of input tensors to all processes in
+a group and return gathered list of tensors in output list.
+
+The full list of supported collectives can be found by looking at the latest documentation for PyTorch Distributed
+`(link) <https://pytorch.org/docs/stable/distributed.html>`__.
+
 
 Distributed Training
 --------------------
@@ -275,7 +290,7 @@ gradients of their model on their batch of data and then average their
 gradients. In order to ensure similar convergence results when changing
 the number of processes, we will first have to partition our dataset.
 (You could also use
-`tnt.dataset.SplitDataset <https://github.com/pytorch/tnt/blob/master/torchnet/dataset/splitdataset.py#L4>`__,
+`torch.utils.data.random_split <https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split>`__,
 instead of the snippet below.)
 
 .. code:: python
@@ -389,7 +404,7 @@ could train any model on a large computer cluster.
 lot more tricks <https://seba-1511.github.io/dist_blog>`__ required to
 implement a production-level implementation of synchronous SGD. Again,
 use what `has been tested and
-optimized <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
+optimized <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
 
 Our Own Ring-Allreduce
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -451,8 +466,9 @@ Communication Backends
 
 One of the most elegant aspects of ``torch.distributed`` is its ability
 to abstract and build on top of different backends. As mentioned before,
-there are currently three backends implemented in PyTorch: Gloo, NCCL, and
-MPI. They each have different specifications and tradeoffs, depending
+there are multiple backends implemented in PyTorch.
+Some of the most popular ones are Gloo, NCCL, and MPI.
+They each have different specifications and tradeoffs, depending
 on the desired use case. A comparative table of supported functions can
 be found
 `here <https://pytorch.org/docs/stable/distributed.html#module-torch.distributed>`__.
@@ -544,15 +560,15 @@ NCCL backend is included in the pre-built binaries with CUDA support.
 Initialization Methods
 ~~~~~~~~~~~~~~~~~~~~~~
 
-To finish this tutorial, let's talk about the very first function we
-called: ``dist.init_process_group(backend, init_method)``. In
-particular, we will go over the different initialization methods which
-are responsible for the initial coordination step between each process.
-Those methods allow you to define how this coordination is done.
-Depending on your hardware setup, one of these methods should be
-naturally more suitable than the others. In addition to the following
-sections, you should also have a look at the `official
-documentation <https://pytorch.org/docs/stable/distributed.html#initialization>`__.
+To conclude this tutorial, let's examine the initial function we invoked:
+``dist.init_process_group(backend, init_method)``. Specifically, we will discuss the various
+initialization methods responsible for the preliminary coordination step between each process.
+These methods enable you to define how this coordination is accomplished.
+
+The choice of initialization method depends on your hardware setup, and one method may be more
+suitable than others. In addition to the following sections, please refer to the `official
+documentation <https://pytorch.org/docs/stable/distributed.html#initialization>`__ for further information.
+
 
 **Environment Variable**