remove MPI & Horovod per Jeff S.

sdgilley · sdgilley · commit cfae93734a35 · 2024-07-08T13:57:57.000-05:00
diff --git a/articles/machine-learning/how-to-train-distributed-gpu.md b/articles/machine-learning/how-to-train-distributed-gpu.md
@@ -32,43 +32,6 @@ Review the basic concepts of [distributed GPU training](concept-distributed-trai
 > [!TIP]
 > If you don't know which type of parallelism to use, more than 90% of the time you should use **distributed data parallelism**.
 
-## MPI
-
-Azure Machine Learning offers an [MPI job](https://www.mcs.anl.gov/research/projects/mpi/) to launch a given number of processes in each node. Azure Machine Learning constructs the full MPI launch command (`mpirun`) behind the scenes. You can't provide your own full head-node-launcher commands like `mpirun` or `DeepSpeed launcher`.
-
-> [!TIP]
-> The base Docker image used by an Azure Machine Learning MPI job needs to have an MPI library installed. [Open MPI](https://www.open-mpi.org) is included in all the [Azure Machine Learning GPU base images](https://github.com/Azure/AzureML-Containers). When you use a custom Docker image, you are responsible for making sure the image includes an MPI library. Open MPI is recommended, but you can also use a different MPI implementation such as Intel MPI. Azure Machine Learning also provides [curated environments](resource-curated-environments.md) for popular frameworks.
-
-
-
-
-### Horovod
-
-Use the MPI job configuration when you use [Horovod](https://horovod.readthedocs.io/en/stable/index.html) for distributed training with the deep learning framework.
-
-Make sure your code follows these tips:
-
-* The training code is instrumented correctly with Horovod before adding the Azure Machine Learning parts.
-* Your Azure Machine Learning environment contains Horovod and MPI. The PyTorch and TensorFlow curated GPU environments come preconfigured with Horovod and its dependencies.
-* Create a `command` with your desired distribution.
-
-### Horovod example
-
-* For the full notebook to run the Horovod example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb).
-
-### Environment variables from Open MPI
-
-When running MPI jobs with Open MPI images, you can use the following environment variables for each process launched:
-
-1. `OMPI_COMM_WORLD_RANK`: The rank of the process
-2. `OMPI_COMM_WORLD_SIZE`: The world size
-3. `AZ_BATCH_MASTER_NODE`: The primary address with port, `MASTER_ADDR:MASTER_PORT`
-4. `OMPI_COMM_WORLD_LOCAL_RANK`: The local rank of the process on the node
-5. `OMPI_COMM_WORLD_LOCAL_SIZE`: The number of processes on the node
-
-> [!TIP]
-> Despite the name, the environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
-
 ## PyTorch
 
 Azure Machine Learning supports running distributed jobs using PyTorch's native distributed training capabilities (`torch.distributed`).