You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/machine-learning/tutorial-horovod-tensorflow.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
-
title: 'Tutorial: Distributed training with Horovod and Tensorflow'
3
-
description: Tutorial on how to run distributed training with the Horovod Runner and Tensorflow
2
+
title: 'Tutorial: Distributed training with Horovod and TensorFlow'
3
+
description: Tutorial on how to run distributed training with the Horovod Runner and TensorFlow
4
4
ms.service: synapse-analytics
5
5
ms.subservice: machine-learning
6
6
ms.topic: tutorial
@@ -9,11 +9,11 @@ author: midesa
9
9
ms.author: midesa
10
10
---
11
11
12
-
# Tutorial: Distributed Training with Horovod Runner and Tensorflow (Preview)
12
+
# Tutorial: Distributed Training with Horovod Runner and TensorFlow (Preview)
13
13
14
14
[Horovod](https://github.com/horovod/horovod) is a distributed training framework for libraries like TensorFlow and PyTorch. With Horovod, users can scale up an existing training script to run on hundreds of GPUs in just a few lines of code.
15
15
16
-
Within Azure Synapse Analytics, users can quickly get started with Horovod using the default Apache Spark 3 runtime.For Spark ML pipeline applications using Tensorflow, users can use ```HorovodRunner```. This notebook uses an Apache Spark dataframe to perform distributed training of a distributed neural network (DNN) model on MNIST dataset. This tutorial leverages Tensorflow and the ```HorovodRunner``` to run the training process.
16
+
Within Azure Synapse Analytics, users can quickly get started with Horovod using the default Apache Spark 3 runtime.For Spark ML pipeline applications using TensorFlow, users can use ```HorovodRunner```. This notebook uses an Apache Spark dataframe to perform distributed training of a distributed neural network (DNN) model on MNIST dataset. This tutorial leverages TensorFlow and the ```HorovodRunner``` to run the training process.
17
17
18
18
## Prerequisites
19
19
@@ -22,7 +22,7 @@ Within Azure Synapse Analytics, users can quickly get started with Horovod using
22
22
23
23
## Configure the Apache Spark session
24
24
25
-
At the start of the session, we will need to configure a few Apache Spark settings. In most cases, we only needs to set the ```numExecutors``` and ```spark.rapids.memory.gpu.reserve```. For very large models, users may also need to configure the ```spark.kryoserializer.buffer.max``` setting. For Tensorflow models, users will need to set the ```spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH``` to be true.
25
+
At the start of the session, we will need to configure a few Apache Spark settings. In most cases, we only needs to set the ```numExecutors``` and ```spark.rapids.memory.gpu.reserve```. For very large models, users may also need to configure the ```spark.kryoserializer.buffer.max``` setting. For TensorFlow models, users will need to set the ```spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH``` to be true.
26
26
27
27
In the example below, you can see how the Spark configurations can be passed with the ```%%configure``` command. The detailed meaning of each parameter is explained in the [Apache Spark configuration documentation](https://spark.apache.org/docs/latest/configuration.html). The values provided below are the suggested, best practice values for Azure Synapse GPU-large pools.
Once we have finished processing our dataset, we can now define our Tensorflow model. The same code could also be used to train a single-node Tensorflow model.
129
+
Once we have finished processing our dataset, we can now define our TensorFlow model. The same code could also be used to train a single-node TensorFlow model.
130
130
131
131
```python
132
132
# Define the TensorFlow model without any Horovod-specific parameters
@@ -153,7 +153,7 @@ def get_model():
153
153
154
154
## Define a training function for a single node
155
155
156
-
First, we will train our Tensorflow model on the driver node of the Apache Spark pool. Once we have finished the training process, we will evaluate the model and print the loss and accuracy scores.
156
+
First, we will train our TensorFlow model on the driver node of the Apache Spark pool. Once we have finished the training process, we will evaluate the model and print the loss and accuracy scores.
157
157
158
158
```python
159
159
@@ -346,4 +346,4 @@ To ensure the Spark instance is shut down, end any connected sessions(notebooks)
346
346
## Next steps
347
347
348
348
*[Check out Synapse sample notebooks](https://github.com/Azure-Samples/Synapse/tree/main/MachineLearning)
349
-
*[Learn more about GPU-enabled Apache Spark pools](../spark/apache-spark-gpu-concept.md)
349
+
*[Learn more about GPU-enabled Apache Spark pools](../spark/apache-spark-gpu-concept.md)
0 commit comments