Merge branch 'main' into dcp_async_save

svekars · web-flow · commit ed9c46a43aa1 · 2024-07-22T09:41:13.000-07:00
diff --git a/.jenkins/build.sh b/.jenkins/build.sh
@@ -21,9 +21,9 @@ sudo apt-get install -y pandoc
 
 #Install PyTorch Nightly for test.
 # Nightly - pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
-# Install 2.2 for testing - uncomment to install nightly binaries (update the version as needed).
-# pip uninstall -y torch torchvision torchaudio torchtext torchdata
-# pip3 install torch==2.3.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu121
+# Install 2.4 to merge all 2.4 PRs - uncomment to install nightly binaries (update the version as needed).
+pip uninstall -y torch torchvision torchaudio torchtext torchdata
+pip3 install torch==2.4.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
 
 # Install two language tokenizers for Translation with TorchText tutorial
 python -m spacy download en_core_web_sm
diff --git a/recipes_source/distributed_device_mesh.rst b/recipes_source/distributed_device_mesh.rst
@@ -148,6 +148,26 @@ Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable
 
     torchrun --nproc_per_node=8 hsdp.py
 
+How to use DeviceMesh for your custom parallel solutions
+--------------------------------------------------------
+When working with large scale training, you might have more complex custom parallel training composition. For example, you may need to slice out submeshes for different parallelism solutions.
+DeviceMesh allows users to slice child mesh from the parent mesh and re-use the NCCL communicators already created when the parent mesh is initialized.
+
+.. code-block:: python
+
+    from torch.distributed.device_mesh import init_device_mesh
+    mesh_3d = init_device_mesh("cuda", (2, 2, 2), mesh_dim_names=("replicate", "shard", "tp"))
+
+    # Users can slice child meshes from the parent mesh.
+    hsdp_mesh = mesh_3d["replicate", "shard"]
+    tp_mesh = mesh_3d["tp"]
+
+    # Users can access the underlying process group thru `get_group` API.
+    replicate_group = hsdp_mesh["replicate"].get_group()
+    shard_group = hsdp_mesh["Shard"].get_group()
+    tp_group = tp_mesh.get_group()
+
+
 Conclusion
 ----------
 In conclusion, we have learned about :class:`DeviceMesh` and :func:`init_device_mesh`, as well as how