Skip to content

Commit cfae937

Browse files
committed
remove MPI & Horovod per Jeff S.
1 parent c8ccb3c commit cfae937

File tree

1 file changed

+0
-37
lines changed

1 file changed

+0
-37
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 0 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -32,43 +32,6 @@ Review the basic concepts of [distributed GPU training](concept-distributed-trai
3232
> [!TIP]
3333
> If you don't know which type of parallelism to use, more than 90% of the time you should use **distributed data parallelism**.
3434
35-
## MPI
36-
37-
Azure Machine Learning offers an [MPI job](https://www.mcs.anl.gov/research/projects/mpi/) to launch a given number of processes in each node. Azure Machine Learning constructs the full MPI launch command (`mpirun`) behind the scenes. You can't provide your own full head-node-launcher commands like `mpirun` or `DeepSpeed launcher`.
38-
39-
> [!TIP]
40-
> The base Docker image used by an Azure Machine Learning MPI job needs to have an MPI library installed. [Open MPI](https://www.open-mpi.org) is included in all the [Azure Machine Learning GPU base images](https://github.com/Azure/AzureML-Containers). When you use a custom Docker image, you are responsible for making sure the image includes an MPI library. Open MPI is recommended, but you can also use a different MPI implementation such as Intel MPI. Azure Machine Learning also provides [curated environments](resource-curated-environments.md) for popular frameworks.
41-
42-
43-
44-
45-
### Horovod
46-
47-
Use the MPI job configuration when you use [Horovod](https://horovod.readthedocs.io/en/stable/index.html) for distributed training with the deep learning framework.
48-
49-
Make sure your code follows these tips:
50-
51-
* The training code is instrumented correctly with Horovod before adding the Azure Machine Learning parts.
52-
* Your Azure Machine Learning environment contains Horovod and MPI. The PyTorch and TensorFlow curated GPU environments come preconfigured with Horovod and its dependencies.
53-
* Create a `command` with your desired distribution.
54-
55-
### Horovod example
56-
57-
* For the full notebook to run the Horovod example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb).
58-
59-
### Environment variables from Open MPI
60-
61-
When running MPI jobs with Open MPI images, you can use the following environment variables for each process launched:
62-
63-
1. `OMPI_COMM_WORLD_RANK`: The rank of the process
64-
2. `OMPI_COMM_WORLD_SIZE`: The world size
65-
3. `AZ_BATCH_MASTER_NODE`: The primary address with port, `MASTER_ADDR:MASTER_PORT`
66-
4. `OMPI_COMM_WORLD_LOCAL_RANK`: The local rank of the process on the node
67-
5. `OMPI_COMM_WORLD_LOCAL_SIZE`: The number of processes on the node
68-
69-
> [!TIP]
70-
> Despite the name, the environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
71-
7235
## PyTorch
7336

7437
Azure Machine Learning supports running distributed jobs using PyTorch's native distributed training capabilities (`torch.distributed`).

0 commit comments

Comments
 (0)