Skip to content

Commit b3ad796

Browse files
Merge pull request #267483 from midesa/main
gpu warning messages and aml tutorial
2 parents ec310b9 + 14dba5f commit b3ad796

File tree

10 files changed

+165
-37
lines changed

10 files changed

+165
-37
lines changed
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: Access ADLSg2 data from Azure Machine Learning
3+
description: This article provides an overview on how you can access data in your Azure Data Lake Storage Gen 2 (ADLSg2) account directly from Azure Machine Learning.
4+
author: midesa
5+
ms.service: synapse-analytics
6+
ms.topic: tutorial
7+
ms.subservice: machine-learning
8+
ms.date: 02/27/2024
9+
ms.author: midesa
10+
---
11+
12+
# Tutorial: Accessing Azure Synapse ADLS Gen2 Data in Azure Machine Learning
13+
14+
In this tutorial, we'll guide you through the process of accessing data stored in Azure Synapse Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure Machine Learning (Azure Machine Learning). This capability is especially valuable when you aim to streamline your machine learning workflow by leveraging tools such as Automated ML, integrated model and experiment tracking, or specialized hardware like GPUs available in Azure Machine Learning.
15+
16+
To access ADLS Gen2 data in Azure Machine Learning, we will create an Azure Machine Learning Datastore that points to the Azure Synapse ADLS Gen2 storage account.
17+
18+
## Prerequisites
19+
- An [Azure Synapse Analytics workspace](../get-started-create-workspace.md). Ensure that it has an Azure Data Lake Storage Gen2 storage account configured as the default storage. For the Data Lake Storage Gen2 file system that you work with, ensure that you're the *Storage Blob Data Contributor*.
20+
- An [Azure Machine Learning workspace](../../machine-learning/quickstart-create-resources.md).
21+
22+
## Install libraries
23+
24+
First, we will install the ```azure-ai-ml``` package.
25+
26+
```python
27+
%pip install azure-ai-ml
28+
29+
```
30+
31+
## Create a Datastore
32+
33+
Azure Machine Learning offers a feature known as a Datastore, which acts as a reference to your existing Azure storage account. We will create a Datastore which references our Azure Synapse ADLS Gen2 storage account.
34+
35+
In this example, we'll create a Datastore linking to our Azure Synapse ADLS Gen2 storage. After initializing an ```MLClient``` object, you can provide connection details to your ADLS Gen2 account. Finally, you can execute the code to create or update the Datastore.
36+
37+
```python
38+
from azure.ai.ml.entities import AzureDataLakeGen2Datastore
39+
from azure.ai.ml import MLClient
40+
41+
ml_client = MLClient.from_config()
42+
43+
# Provide the connection details to your Azure Synapse ADLSg2 storage account
44+
store = AzureDataLakeGen2Datastore(
45+
name="",
46+
description="",
47+
account_name="",
48+
filesystem=""
49+
)
50+
51+
ml_client.create_or_update(store)
52+
```
53+
54+
You can learn more about creating and managing Azure Machine Learning datastores using this [tutorial on Azure Machine Learning data stores](../../machine-learning/concept-data.md).
55+
56+
## Mount your ADLS Gen2 Storage Account
57+
58+
Once you have set up your data store, you can then access this data by creating a **mount** to your ADLSg2 account. In Azure Machine Learning, creating a mount to your ADLS Gen2 account entails establishing a direct link between your workspace and the storage account, enabling seamless access to the data stored within. Essentially, a mount acts as a pathway that allows Azure Machine Learning to interact with the files and folders in your ADLS Gen2 account as if they were part of the local filesystem within your workspace.
59+
60+
Once the storage account is mounted, you can effortlessly read, write, and manipulate data stored in ADLS Gen2 using familiar filesystem operations directly within your Azure Machine Learning environment, simplifying data preprocessing, model training, and experimentation tasks.
61+
62+
To do this:
63+
64+
1. Start your compute engine.
65+
2. Select **Data Actions** and then select **Mount**.
66+
67+
![Screenshot of Azure Machine Learning option to select data actions.](./media/./tutorial-access-data-from-aml/data-actions.png)
68+
69+
1. From here, you should see and select your ADLSg2 storage account name. It may take a few moments for your mount to be created.
70+
1. Once your mount is ready, you can select **Data actions** and then **Consume**. Under **Data**, you can then select the mount that you want to consume data from.
71+
72+
Now, you can use your preferred libraries to directly read data from your mounted Azure Data Lake Storage account.
73+
74+
## Read data from your storage account
75+
76+
```python
77+
import os
78+
# List the files in the mounted path
79+
print(os.listdir("/home/azureuser/cloudfiles/data/datastore/{name of mount}"))
80+
81+
# Get the path of your file and load the data using your preferred libraries
82+
import pandas as pd
83+
df = pd.read_csv("/home/azureuser/cloudfiles/data/datastore/{name of mount}/{file name}")
84+
print(df.head(5))
85+
```
86+
87+
## Next steps
88+
- [Create and manage GPUs in Azure Machine Learning](../../machine-learning/how-to-train-distributed-gpu.md)
89+
- [Create Automated ML jobs in Azure Machine Learning](../../machine-learning/concept-automated-ml.md)

articles/synapse-analytics/machine-learning/concept-deep-learning.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,26 @@ author: midesa
55
ms.service: synapse-analytics
66
ms.topic: conceptual
77
ms.subservice: machine-learning
8-
ms.date: 04/19/2022
8+
ms.date: 02/27/2024
99
ms.author: midesa
1010
---
1111

1212
# Deep learning (Preview)
1313

1414
Apache Spark in Azure Synapse Analytics enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. There are several options when training machine learning models using Azure Spark in Azure Synapse Analytics: Apache Spark MLlib, Azure Machine Learning, and various other open-source libraries.
1515

16+
> [!WARNING]
17+
> - The GPU accelerated preview is limited to the [Azure Synapse 3.1 (unsupported)](../spark/apache-spark-3-runtime.md) and [Apache Spark 3.2 (EOLA)](../spark/apache-spark-32-runtime.md) runtimes.
18+
> - Azure Synapse Runtime for Apache Spark 3.1 has reached its end of life (EOL) as of January 26, 2023, with official support discontinued effective January 26, 2024, and no further addressing of support tickets, bug fixes, or security updates beyond this date.
19+
> - Azure Synapse Runtime for Apache Spark 3.2 has reached its end of life (EOL) as of July 8, 2023, with no further bug or feature fixes, but security fixes may be backported based on risk assessment, and it will be retired and disabled as of July 8, 2024.
20+
1621
## GPU-enabled Apache Spark pools
1722

1823
To simplify the process for creating and managing pools, Azure Synapse takes care of pre-installing low-level libraries and setting up all the complex networking requirements between compute nodes. This integration allows users to get started with GPU- accelerated pools within just a few minutes. To learn more about how to create a GPU-accelerated pool, you can visit the quickstart on how to [create a GPU-accelerated pool](../quickstart-create-apache-gpu-pool-portal.md).
1924

2025
> [!NOTE]
2126
> - GPU-accelerated pools can be created in workspaces located in East US, Australia East, and North Europe.
22-
> - GPU-accelerated pools are only available with the Apache Spark 3.1 and 3.2 runtime.
27+
> - GPU-accelerated pools are only available with the Apache Spark 3.1 (unsupported) and 3.2 runtime.
2328
> - You might need to request a [limit increase](../spark/apache-spark-rapids-gpu.md#quotas-and-resource-constraints-in-azure-synapse-gpu-enabled-pools) in order to create GPU-enabled clusters.
2429
2530
## GPU ML Environment
416 KB
Loading

articles/synapse-analytics/machine-learning/tutorial-horovod-pytorch.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Tutorial on how to run distributed training with the Horovod Estima
44
ms.service: synapse-analytics
55
ms.subservice: machine-learning
66
ms.topic: tutorial
7-
ms.date: 04/19/2022
7+
ms.date: 02/27/2024
88
author: midesa
99
ms.author: midesa
1010
---
@@ -13,18 +13,24 @@ ms.author: midesa
1313

1414
[Horovod](https://github.com/horovod/horovod) is a distributed training framework for libraries like TensorFlow and PyTorch. With Horovod, users can scale up an existing training script to run on hundreds of GPUs in just a few lines of code.
1515

16-
Within Azure Synapse Analytics, users can quickly get started with Horovod using the default Apache Spark 3 runtime.For Spark ML pipeline applications using PyTorch, users can use the horovod.spark estimator API. This notebook uses an Apache Spark dataframe to perform distributed training of a distributed neural network (DNN) model on MNIST dataset. This tutorial leverages PyTorch and the Horovod Estimator to run the training process.
16+
Within Azure Synapse Analytics, users can quickly get started with Horovod using the default Apache Spark 3 runtime. For Spark ML pipeline applications using PyTorch, users can use the horovod.spark estimator API. This notebook uses an Apache Spark dataframe to perform distributed training of a distributed neural network (DNN) model on MNIST dataset. This tutorial uses PyTorch and the Horovod Estimator to run the training process.
1717

1818
## Prerequisites
1919

2020
- [Azure Synapse Analytics workspace](../get-started-create-workspace.md) with an Azure Data Lake Storage Gen2 storage account configured as the default storage. You need to be the *Storage Blob Data Contributor* of the Data Lake Storage Gen2 file system that you work with.
2121
- Create a GPU-enabled Apache Spark pool in your Azure Synapse Analytics workspace. For details, see [Create a GPU-enabled Apache Spark pool in Azure Synapse](../spark/apache-spark-gpu-concept.md). For this tutorial, we suggest using the GPU-Large cluster size with 3 nodes.
2222

23+
> [!WARNING]
24+
> - The GPU accelerated preview is limited to the [Azure Synapse 3.1 (unsupported)](../spark/apache-spark-3-runtime.md) and [Apache Spark 3.2 (EOLA)](../spark/apache-spark-32-runtime.md) runtimes.
25+
> - Azure Synapse Runtime for Apache Spark 3.1 has reached its end of life (EOL) as of January 26, 2023, with official support discontinued effective January 26, 2024, and no further addressing of support tickets, bug fixes, or security updates beyond this date.
26+
> - Azure Synapse Runtime for Apache Spark 3.2 has reached its end of life (EOL) as of July 8, 2023, with no further bug or feature fixes, but security fixes may be backported based on risk assessment, and it will be retired and disabled as of July 8, 2024.
27+
28+
2329
## Configure the Apache Spark session
2430

25-
At the start of the session, we will need to configure a few Apache Spark settings. In most cases, we only needs to set the numExecutors and spark.rapids.memory.gpu.reserve. For very large models, users may also need to configure the ```spark.kryoserializer.buffer.max``` setting. For Tensorflow models, users will need to set the ```spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH``` to be true.
31+
At the start of the session, we need to configure a few Apache Spark settings. In most cases, we only need to set the numExecutors and spark.rapids.memory.gpu.reserve. For large models, users may also need to configure the ```spark.kryoserializer.buffer.max``` setting. For Tensorflow models, users need to set the ```spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH``` to be true.
2632

27-
In the example below, you can see how the Spark configurations can be passed with the ```%%configure``` command. The detailed meaning of each parameter is explained in the [Apache Spark configuration documentation](https://spark.apache.org/docs/latest/configuration.html). The values provided below are the suggested, best practice values for Azure Synapse GPU-large pools.
33+
In the example, you can see how the Spark configurations can be passed with the ```%%configure``` command. The detailed meaning of each parameter is explained in the [Apache Spark configuration documentation](https://spark.apache.org/docs/latest/configuration.html). The values provided are the suggested, best practice values for Azure Synapse GPU-large pools.
2834

2935
```spark
3036
@@ -61,7 +67,7 @@ For this tutorial, we will use the following configurations:
6167
6268
## Import dependencies
6369

64-
In this tutorial, we will leverage PySpark to read and process the dataset. We will then use PyTorch and Horovod to build the distributed neural network (DNN) model and run the training process. To get started, we will need to import the following dependencies:
70+
In this tutorial, we use PySpark to read and process the dataset. Then, we use PyTorch and Horovod to build the distributed neural network (DNN) model and run the training process. To get started, we need to import the following dependencies:
6571

6672
```python
6773
# base libs
@@ -94,7 +100,7 @@ from azure.synapse.ml.horovodutils import AdlsStore
94100

95101
## Connect to alternative storage account
96102

97-
We will need the Azure Data Lake Storage (ADLS) account for storing intermediate and model data. If you are using an alternative storage account, be sure to set up the [linked service](../../data-factory/concepts-linked-services.md) to automatically authenticate and read from the account. In addition, you will need to modify the following properties below: ```remote_url```, ```account_name```, and ```linked_service_name```.
103+
We need the Azure Data Lake Storage (ADLS) account for storing intermediate and model data. If you are using an alternative storage account, be sure to set up the [linked service](../../data-factory/concepts-linked-services.md) to automatically authenticate and read from the account. In addition, you need to modify the following properties: ```remote_url```, ```account_name```, and ```linked_service_name```.
98104

99105
```python
100106
num_proc = 3 # equal to numExecutors
@@ -164,7 +170,7 @@ train_df.count()
164170

165171
## Define DNN model
166172

167-
Once we have finished processing our dataset, we can now define our PyTorch model. The same code could also be used to train a single-node PyTorch model.
173+
Once we are finished processing our dataset, we can now define our PyTorch model. The same code could also be used to train a single-node PyTorch model.
168174

169175
```python
170176
# Define the PyTorch model without any Horovod-specific parameters
@@ -227,7 +233,7 @@ torch_model = torch_estimator.fit(train_df).setOutputCols(['label_prob'])
227233

228234
## Evaluate trained model
229235

230-
Once the training process has finished, we can then evaluate the model on the test dataset.
236+
Once the training process completes, we can then evaluate the model on the test dataset.
231237

232238
```python
233239
# Evaluate the model on the held-out test DataFrame

0 commit comments

Comments
 (0)