You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-distributed-gpu.md
-37Lines changed: 0 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,43 +32,6 @@ Review the basic concepts of [distributed GPU training](concept-distributed-trai
32
32
> [!TIP]
33
33
> If you don't know which type of parallelism to use, more than 90% of the time you should use **distributed data parallelism**.
34
34
35
-
## MPI
36
-
37
-
Azure Machine Learning offers an [MPI job](https://www.mcs.anl.gov/research/projects/mpi/) to launch a given number of processes in each node. Azure Machine Learning constructs the full MPI launch command (`mpirun`) behind the scenes. You can't provide your own full head-node-launcher commands like `mpirun` or `DeepSpeed launcher`.
38
-
39
-
> [!TIP]
40
-
> The base Docker image used by an Azure Machine Learning MPI job needs to have an MPI library installed. [Open MPI](https://www.open-mpi.org) is included in all the [Azure Machine Learning GPU base images](https://github.com/Azure/AzureML-Containers). When you use a custom Docker image, you are responsible for making sure the image includes an MPI library. Open MPI is recommended, but you can also use a different MPI implementation such as Intel MPI. Azure Machine Learning also provides [curated environments](resource-curated-environments.md) for popular frameworks.
41
-
42
-
43
-
44
-
45
-
### Horovod
46
-
47
-
Use the MPI job configuration when you use [Horovod](https://horovod.readthedocs.io/en/stable/index.html) for distributed training with the deep learning framework.
48
-
49
-
Make sure your code follows these tips:
50
-
51
-
* The training code is instrumented correctly with Horovod before adding the Azure Machine Learning parts.
52
-
* Your Azure Machine Learning environment contains Horovod and MPI. The PyTorch and TensorFlow curated GPU environments come preconfigured with Horovod and its dependencies.
53
-
* Create a `command` with your desired distribution.
54
-
55
-
### Horovod example
56
-
57
-
* For the full notebook to run the Horovod example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb).
58
-
59
-
### Environment variables from Open MPI
60
-
61
-
When running MPI jobs with Open MPI images, you can use the following environment variables for each process launched:
62
-
63
-
1.`OMPI_COMM_WORLD_RANK`: The rank of the process
64
-
2.`OMPI_COMM_WORLD_SIZE`: The world size
65
-
3.`AZ_BATCH_MASTER_NODE`: The primary address with port, `MASTER_ADDR:MASTER_PORT`
66
-
4.`OMPI_COMM_WORLD_LOCAL_RANK`: The local rank of the process on the node
67
-
5.`OMPI_COMM_WORLD_LOCAL_SIZE`: The number of processes on the node
68
-
69
-
> [!TIP]
70
-
> Despite the name, the environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
71
-
72
35
## PyTorch
73
36
74
37
Azure Machine Learning supports running distributed jobs using PyTorch's native distributed training capabilities (`torch.distributed`).
0 commit comments