Skip to content

Commit ada3abf

Browse files
Merge pull request #247012 from sdgilley/sdg-distributed
add more info about environments
2 parents 35712d1 + 0f7588e commit ada3abf

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Azure Machine Learning offers an [MPI job](https://www.mcs.anl.gov/research/proj
4141
4242
To run distributed training using MPI, follow these steps:
4343

44-
1. Use an Azure Machine Learning environment with the preferred deep learning framework and MPI. Azure Machine Learning provides [curated environment](resource-curated-environments.md) for popular frameworks.
44+
1. Use an Azure Machine Learning environment with the preferred deep learning framework and MPI. Azure Machine Learning provides [curated environment](resource-curated-environments.md) for popular frameworks. Or [create a custom environment](how-to-manage-environments-v2.md#create-an-environment) with the preferred deep learning framework and MPI.
4545
1. Define a `command` with `instance_count`. `instance_count` should be equal to the number of GPUs per node for per-process-launch, or set to 1 (the default) for per-node-launch if the user script will be responsible for launching the processes per node.
4646
1. Use the `distribution` parameter of the `command` to specify settings for `MpiDistribution`.
4747

@@ -181,3 +181,5 @@ If you create an `AmlCompute` cluster of one of these RDMA-capable, InfiniBand-e
181181

182182
* [Deploy and score a machine learning model by using an online endpoint](how-to-deploy-online-endpoints.md)
183183
* [Reference architecture for distributed deep learning training in Azure](/azure/architecture/reference-architectures/ai/training-deep-learning)
184+
* [Troubleshooting environment issues](how-to-troubleshoot-environments.md)
185+

0 commit comments

Comments
 (0)