Skip to content

Commit 2391115

Browse files
committed
Intro rework
1 parent df5b054 commit 2391115

File tree

2 files changed

+13
-19
lines changed

2 files changed

+13
-19
lines changed

articles/machine-learning/concept-distributed-training.md

Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -11,44 +11,38 @@ ms.topic: conceptual
1111
ms.date: 03/27/2020
1212
---
1313

14-
# What is distributed training?
14+
# Distributed training with Azure Machine Learning
1515

16-
Distributed training refers to the ability to share and parallelize data loads and training tasks across multiple GPUs to accelerate model training. The typical use case for distributed training is for training deep neural networks and [deep learning](concept-deep-learning-vs-machine-learning.md) models.
16+
In distributed training the work load to train a model is split up and shared among multiple mini processors, called worker nodes. These worker nodes work in parallel to speed up model training.
1717

18-
Deep neural networks are often compute intensive, as they require large learning workloads in order to process millions of examples and parameters across multiple layers. This deep learning lends itself well to distributed training, since running tasks in parallel, instead of serially, saves time and compute resources.
18+
This training is well suited for compute and time intensive tasks, like training deep neural networks and [deep learning](concept-deep-learning-vs-machine-learning.md).
19+
20+
There are two main types of distributed training: [data parallelism](#data-parallelism) and [model parallelism](#model-parallelism). Azure Machine Learning currently only supports integrations with frameworks that can perform data parallelism.
1921

2022
## Distributed training in Azure Machine Learning
2123

22-
Azure Machine Learning supports distributed training via integrations with popular deep learning frameworks, PyTorch and TensorFlow. Both PyTorch and TensorFlow employ [data parallelism](#data-parallelism) for distributed training, and leverage [Horovod](https://horovod.readthedocs.io/en/latest/summary_include.html) for optimizing compute speeds.
24+
Azure Machine Learning is integrated with popular deep learning frameworks, PyTorch and TensorFlow. Both frameworks employ data parallelism for distributed training, and leverage [horovod](https://horovod.readthedocs.io/en/latest/summary_include.html) for optimizing compute speeds.
2325

24-
* [Distributed training with PyTorch](how-to-train-tensorflow.md#distributed-training)
26+
* [Distributed training with PyTorch in the Python SDK](how-to-train-pytorch.md#distributed-training)
2527

26-
* [Distributed training with TensorFlow](how-to-train-pytorch.md#distributed-training)
28+
* [Distributed training with TensorFlow in the ](how-to-train-tensorflow.md#distributed-training)
2729

2830
For training traditional ML models, see [Azure Machine Learning SDK for Python](concept-train-machine-learning-model.md#python-sdk) for the different ways to train models using the Python SDK.
2931

30-
## Types of distributed training
31-
32-
There are two main types of distributed training: **data parallelism** and **model parallelism**.
33-
34-
### Data parallelism
32+
## Data parallelism
3533

3634
In data parallelism, the data is divided into partitions, where the number of partitions is equal to the total number of available nodes, in the compute cluster. The model is copied in each of these worker nodes, and each worker operates on its own subset of the data. Keep in mind that each node has to have the capacity to support the model that's being trained, that is the model has to entirely fit on each node.
3735

3836
Each node independently computes the errors between its predictions for its training samples and the labeled outputs. In turn, each node updates its model based on the errors and must communicate all of its changes to the other nodes to update their corresponding models. This means that the worker nodes need to synchronize the model parameters, or gradients, at the end of the batch computation to ensure they are training a consistent model.
3937

40-
### Model parallelism
38+
## Model parallelism
4139

4240
In model parallelism, also known as network parallelism, the model is segmented into different parts that can run concurrently in different nodes, and each one will run on the same data. The scalability of this method depends on the degree of task parallelization of the algorithm, and it is more complex to implement than data parallelism.
4341

4442
In model parallelism, worker nodes only need to synchronize the shared parameters, usually once for each forward or backward-propagation step. Also, larger models aren't a concern since each node operates on a subsection of the model on the same training data.
4543

4644
## Next steps
4745

48-
* Learn how to [Set up training environments](how-to-set-up-training-targets.md).
49-
46+
* Learn how to [set up training environments](how-to-set-up-training-targets.md) with the Python SDK.
5047
* [Train ML models with TensorFlow](how-to-train-tensorflow.md).
51-
52-
* [Train ML models with PyTorch](how-to-train-pytorch.md).
53-
54-
48+
* [Train ML models with PyTorch](how-to-train-pytorch.md).

articles/machine-learning/toc.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@
9898
displayName: run config, estimator, machine learning pipeline, ml pipeline, train model
9999
href: concept-train-machine-learning-model.md
100100
- name: Distributed training
101-
displayName: paralellism, deep learning, deep neural network, dnn
101+
displayName: parallellization, deep learning, deep neural network, dnn
102102
href: concept-distributed-training.md
103103
- name: Model management (MLOps)
104104
displayName: deploy, deployment, publish, production, operationalize, operationalization

0 commit comments

Comments
 (0)