Skip to content

Commit 153063a

Browse files
authored
Merge pull request #269974 from cdpark/group3-distributed-sdgilley
User Story 233117: Q&M: March AzureML Freshness updates - Distributed training
2 parents f246751 + a2322cc commit 153063a

File tree

1 file changed

+14
-13
lines changed

1 file changed

+14
-13
lines changed

articles/machine-learning/concept-distributed-training.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,43 +10,44 @@ ms.reviewer: sgilley
1010
ms.subservice: training
1111
ms.custom: build-2023
1212
ms.topic: conceptual
13-
ms.date: 03/27/2020
13+
ms.date: 03/22/2024
1414
---
1515

1616
# Distributed training with Azure Machine Learning
1717

1818
In this article, you learn about distributed training and how Azure Machine Learning supports it for deep learning models.
1919

20-
In distributed training the workload to train a model is split up and shared among multiple mini processors, called worker nodes. These worker nodes work in parallel to speed up model training. Distributed training can be used for traditional ML models, but is better suited for compute and time intensive tasks, like [deep learning](concept-deep-learning-vs-machine-learning.md) for training deep neural networks.
20+
In distributed training, the workload to train a model is split up and shared among multiple mini processors, called worker nodes. These worker nodes work in parallel to speed up model training. Distributed training can be used for traditional machine learning models, but is better suited for compute and time intensive tasks, like [deep learning](concept-deep-learning-vs-machine-learning.md) for training deep neural networks.
2121

22-
## Deep learning and distributed training
23-
24-
There are two main types of distributed training: [data parallelism](#data-parallelism) and [model parallelism](#model-parallelism). For distributed training on deep learning models, the [Azure Machine Learning SDK in Python](/python/api/overview/azure/ml/intro) supports integrations with popular frameworks, PyTorch and TensorFlow. Both frameworks employ data parallelism for distributed training, and can leverage [horovod](https://horovod.readthedocs.io/en/latest/summary_include.html) for optimizing compute speeds.
22+
## Deep learning and distributed training
2523

24+
There are two main types of distributed training: [data parallelism](#data-parallelism) and [model parallelism](#model-parallelism). For distributed training on deep learning models, the [Azure Machine Learning SDK in Python](/python/api/overview/azure/ml/intro) supports integrations with PyTorch and TensorFlow. Both are popular frameworks that employ data parallelism for distributed training, and can use [Horovod](https://horovod.readthedocs.io/en/latest/summary_include.html) to optimize compute speeds.
2625

2726
* [Distributed training with PyTorch](how-to-train-distributed-gpu.md#pytorch)
2827

2928
* [Distributed training with TensorFlow](how-to-train-distributed-gpu.md#tensorflow)
3029

31-
For ML models that don't require distributed training, see [train models with Azure Machine Learning](concept-train-machine-learning-model.md#python-sdk) for the different ways to train models using the Python SDK.
30+
For machine learning models that don't require distributed training, see [Train models with Azure Machine Learning](concept-train-machine-learning-model.md#python-sdk) for different ways to train models using the Python SDK.
3231

3332
## Data parallelism
3433

3534
Data parallelism is the easiest to implement of the two distributed training approaches, and is sufficient for most use cases.
3635

37-
In this approach, the data is divided into partitions, where the number of partitions is equal to the total number of available nodes, in the compute cluster or [serverless compute](./how-to-use-serverless-compute.md). The model is copied in each of these worker nodes, and each worker operates on its own subset of the data. Keep in mind that each node has to have the capacity to support the model that's being trained, that is the model has to entirely fit on each node. The following diagram provides a visual demonstration of this approach.
36+
In this approach, the data is divided into partitions, where the number of partitions is equal to the total number of available nodes, in the compute cluster or [serverless compute](./how-to-use-serverless-compute.md). The model is copied in each of these worker nodes, and each node operates on its own subset of the data. Keep in mind that each node must have the capacity to support the model that's being trained, that is, the entire model has to fit on each node.
37+
38+
The following diagram shows this approach.
3839

39-
![Data-parallelism-concept-diagram](./media/concept-distributed-training/distributed-training.svg)
40+
:::image type="content" source="media/concept-distributed-training/distributed-training.svg" alt-text="Diagram of data parrallelism showing the model copied into worker nodes.":::
4041

41-
Each node independently computes the errors between its predictions for its training samples and the labeled outputs. In turn, each node updates its model based on the errors and must communicate all of its changes to the other nodes to update their corresponding models. This means that the worker nodes need to synchronize the model parameters, or gradients, at the end of the batch computation to ensure they are training a consistent model.
42+
Each node independently computes the errors between its predictions for its training samples and the labeled outputs. In turn, each node updates its model based on the errors and must communicate all of its changes to the other nodes to update their corresponding models. Worker nodes need to synchronize the model parameters, or gradients, at the end of the batch computation to ensure they're training a consistent model.
4243

4344
## Model parallelism
4445

45-
In model parallelism, also known as network parallelism, the model is segmented into different parts that can run concurrently in different nodes, and each one will run on the same data. The scalability of this method depends on the degree of task parallelization of the algorithm, and it is more complex to implement than data parallelism.
46+
In model parallelism, also known as network parallelism, the model is segmented into different parts that can run concurrently in different nodes, and each one runs on the same data. The scalability of this method depends on the degree of task parallelization of the algorithm, and it's more complex to implement than data parallelism.
4647

4748
In model parallelism, worker nodes only need to synchronize the shared parameters, usually once for each forward or backward-propagation step. Also, larger models aren't a concern since each node operates on a subsection of the model on the same training data.
4849

49-
## Next steps
50+
## Related content
5051

51-
* For a technical example, see the [reference architecture scenario](/azure/architecture/reference-architectures/ai/training-deep-learning).
52-
* Find tips for MPI, TensorFlow, and PyTorch in the [Distributed GPU training guide](how-to-train-distributed-gpu.md)
52+
* [Artificial intelligence (AI) architecture design](/azure/architecture/reference-architectures/ai/training-deep-learning)
53+
* [Distributed GPU training guide](how-to-train-distributed-gpu.md)

0 commit comments

Comments
 (0)