You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/concept-distributed-training.md
+30-16Lines changed: 30 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,36 +8,50 @@ author: nibaccam
8
8
ms.author: nibaccam
9
9
ms.subservice: core
10
10
ms.topic: conceptual
11
-
ms.date: 03/23/2020
11
+
ms.date: 03/24/2020
12
12
---
13
13
14
14
# Distributed training with Azure Machine Learning
15
15
16
-
What is distributed training
16
+
## What is distributed training?
17
17
18
+
Distributed training refers to the ability to share data loads and training tasks across multiple GPUs to accelerate model training. The typical use case for distributed training is for training [deep learning](concept-deep-learning-vs-machine-learning.md) models.
18
19
19
-
Distributed training for machine learning models refers to the ability to scale trainign and share the load across multiple GPU or CPU nodes in parallel in order to facilitate and accelerate the training of machine learning models.
20
-
Azure Machine learning is integrated with two common open source frameworks that support distributed training, PyTorch and Tensorflow.
20
+
Deep neural networks are often computed intensive as they require large learning workloads in order to processing millions of examples and parameters across its multiple layers. This deep learning lends itself well to distributed training, since running tasks in parallel instead of serially saves time and compute resources.
21
21
22
-
multiple GPUs on one machine
22
+
## Distributed training in Azure Machine Learning
23
23
24
-
The different types of distributed
24
+
Azure Machine Learning supports distributed training via integrations with popular deep learning frameworks, PyTorch and TensorFlow. Both PyTorch and TensorFlow employ [data parallelism](#data-parallelism) for distributed training, and leverage [Horovod](https://horovod.readthedocs.io/en/latest/summary_include.html) for optimizing compute speeds.
25
25
26
-
If you are looking to simply train your model without leveraging distributed training, see [Azure Machine Learning SDK for Python](#python-sdk) for the different ways to train models.
26
+
*[Distributed training with PyTorch](how-to-train-tensorflow.md#distributed-training)
27
27
28
-
When to use distributed training
28
+
*[Distributed training with TensorFlow](how-to-train-pytorch.md#distributed-training)
29
29
30
-
Use distributed training when your data strains your single machine compute power.
31
-
If your data is too large to store in your local RAM
32
-
If your data loading becomes time consuming and cumbersome
33
30
31
+
For training traditional ML models, see [Azure Machine Learning SDK for Python](#python-sdk) for the different ways to train models using the Python SDK.
34
32
35
-
What does Azure Machine Learning support
36
-
Azure Machine Learning offers estimators with PyTorch and Tensorflow which support distributed training
33
+
## Types of distributed training
37
34
38
-
how-to-train-tensorflow.md#distributed-training
39
-
how-to-train-pytorch.md#distributed-training
35
+
There are two main types of distributed training: **data parallelism** and **model parallelism**.
36
+
37
+
### Data parallelism
38
+
39
+
In data parallelism, the data is divided into partitions, where the number of partitions is equal to the total number of available nodes, in the compute cluster. The model is copied in each of these worker nodes, and each worker operates on its own subset of the data. Keep in mind that the model has to entirely fit on each node, that is each node has to have the capacity to support the model that's being trained.
40
+
41
+
Each node independently computes the errors between its predictions for its training samples and the labeled outputs. This means that the worker nodes need to synchronize the model parameters, or gradients, at the end of the batch computation to ensure they are training a consistent model. In turn, each node must communicate all of its changes to the other nodes to update all of the models.
42
+
43
+
### Model parallelism
44
+
45
+
In model parallelism, also known as network parallelism, the model is segmented into different parts that can run concurrently, and each one will run on the same data in different nodes. The scalability of this method depends on the degree of task parallelization of the algorithm, and it is more complex to implement than data parallelism.
46
+
47
+
In model parallelism, worker nodes only need to synchronize the shared parameters, usually once for each forward or backward-propagation step. Also, larger models aren't a concern since each node operates on a subsection of the model on the same training data.
40
48
41
49
## Next steps
42
50
43
-
Learn how to [Set up training environments](how-to-set-up-training-targets.md).
51
+
* Learn how to [Set up training environments](how-to-set-up-training-targets.md).
52
+
53
+
* Train ML models with TensorFlow(how-to-train-tensorflow.md)
54
+
55
+
* Train ML models with PyTorch(how-to-train-pytorch.md)
0 commit comments