You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Access ADLSg2 data from Azure Machine Learning
3
+
description: This article provides an overview on how you can access data in your Azure Data Lake Storage Gen 2 (ADLSg2) account directly from Azure Machine Learning.
4
+
author: midesa
5
+
ms.service: synapse-analytics
6
+
ms.topic: tutorial
7
+
ms.subservice: machine-learning
8
+
ms.date: 02/27/2024
9
+
ms.author: midesa
10
+
---
11
+
12
+
# Tutorial: Accessing Azure Synapse ADLS Gen2 Data in Azure Machine Learning
13
+
14
+
In this tutorial, we'll guide you through the process of accessing data stored in Azure Synapse Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure Machine Learning (Azure Machine Learning). This capability is especially valuable when you aim to streamline your machine learning workflow by leveraging tools such as Automated ML, integrated model and experiment tracking, or specialized hardware like GPUs available in Azure Machine Learning.
15
+
16
+
To access ADLS Gen2 data in Azure Machine Learning, we will create an Azure Machine Learning Datastore that points to the Azure Synapse ADLS Gen2 storage account.
17
+
18
+
## Prerequisites
19
+
- An [Azure Synapse Analytics workspace](../get-started-create-workspace.md). Ensure that it has an Azure Data Lake Storage Gen2 storage account configured as the default storage. For the Data Lake Storage Gen2 file system that you work with, ensure that you're the *Storage Blob Data Contributor*.
20
+
- An [Azure Machine Learning workspace](../../machine-learning/quickstart-create-resources.md).
21
+
22
+
## Install libraries
23
+
24
+
First, we will install the ```azure-ai-ml``` package.
25
+
26
+
```python
27
+
%pip install azure-ai-ml
28
+
29
+
```
30
+
31
+
## Create a Datastore
32
+
33
+
Azure Machine Learning offers a feature known as a Datastore, which acts as a reference to your existing Azure storage account. We will create a Datastore which references our Azure Synapse ADLS Gen2 storage account.
34
+
35
+
In this example, we'll create a Datastore linking to our Azure Synapse ADLS Gen2 storage. After initializing an ```MLClient``` object, you can provide connection details to your ADLS Gen2 account. Finally, you can execute the code to create or update the Datastore.
36
+
37
+
```python
38
+
from azure.ai.ml.entities import AzureDataLakeGen2Datastore
39
+
from azure.ai.ml import MLClient
40
+
41
+
ml_client = MLClient.from_config()
42
+
43
+
# Provide the connection details to your Azure Synapse ADLSg2 storage account
44
+
store = AzureDataLakeGen2Datastore(
45
+
name="",
46
+
description="",
47
+
account_name="",
48
+
filesystem=""
49
+
)
50
+
51
+
ml_client.create_or_update(store)
52
+
```
53
+
54
+
You can learn more about creating and managing Azure Machine Learning datastores using this [tutorial on Azure Machine Learning data stores](../../machine-learning/concept-data.md).
55
+
56
+
## Mount your ADLS Gen2 Storage Account
57
+
58
+
Once you have set up your data store, you can then access this data by creating a **mount** to your ADLSg2 account. In Azure Machine Learning, creating a mount to your ADLS Gen2 account entails establishing a direct link between your workspace and the storage account, enabling seamless access to the data stored within. Essentially, a mount acts as a pathway that allows Azure Machine Learning to interact with the files and folders in your ADLS Gen2 account as if they were part of the local filesystem within your workspace.
59
+
60
+
Once the storage account is mounted, you can effortlessly read, write, and manipulate data stored in ADLS Gen2 using familiar filesystem operations directly within your Azure Machine Learning environment, simplifying data preprocessing, model training, and experimentation tasks.
61
+
62
+
To do this:
63
+
64
+
1. Start your compute engine.
65
+
2. Select **Data Actions** and then select **Mount**.
66
+
67
+

68
+
69
+
1. From here, you should see and select your ADLSg2 storage account name. It may take a few moments for your mount to be created.
70
+
1. Once your mount is ready, you can select **Data actions** and then **Consume**. Under **Data**, you can then select the mount that you want to consume data from.
71
+
72
+
Now, you can use your preferred libraries to directly read data from your mounted Azure Data Lake Storage account.
73
+
74
+
## Read data from your storage account
75
+
76
+
```python
77
+
import os
78
+
# List the files in the mounted path
79
+
print(os.listdir("/home/azureuser/cloudfiles/data/datastore/{name of mount}"))
80
+
81
+
# Get the path of your file and load the data using your preferred libraries
82
+
import pandas as pd
83
+
df = pd.read_csv("/home/azureuser/cloudfiles/data/datastore/{name of mount}/{file name}")
84
+
print(df.head(5))
85
+
```
86
+
87
+
## Next steps
88
+
-[Create and manage GPUs in Azure Machine Learning](../../machine-learning/how-to-train-distributed-gpu.md)
89
+
-[Create Automated ML jobs in Azure Machine Learning](../../machine-learning/concept-automated-ml.md)
Copy file name to clipboardExpand all lines: articles/synapse-analytics/machine-learning/concept-deep-learning.md
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,21 +5,26 @@ author: midesa
5
5
ms.service: synapse-analytics
6
6
ms.topic: conceptual
7
7
ms.subservice: machine-learning
8
-
ms.date: 04/19/2022
8
+
ms.date: 02/27/2024
9
9
ms.author: midesa
10
10
---
11
11
12
12
# Deep learning (Preview)
13
13
14
14
Apache Spark in Azure Synapse Analytics enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. There are several options when training machine learning models using Azure Spark in Azure Synapse Analytics: Apache Spark MLlib, Azure Machine Learning, and various other open-source libraries.
15
15
16
+
> [!WARNING]
17
+
> - The GPU accelerated preview is limited to the [Azure Synapse 3.1 (unsupported)](../spark/apache-spark-3-runtime.md) and [Apache Spark 3.2 (EOLA)](../spark/apache-spark-32-runtime.md) runtimes.
18
+
> - Azure Synapse Runtime for Apache Spark 3.1 has reached its end of life (EOL) as of January 26, 2023, with official support discontinued effective January 26, 2024, and no further addressing of support tickets, bug fixes, or security updates beyond this date.
19
+
> - Azure Synapse Runtime for Apache Spark 3.2 has reached its end of life (EOL) as of July 8, 2023, with no further bug or feature fixes, but security fixes may be backported based on risk assessment, and it will be retired and disabled as of July 8, 2024.
20
+
16
21
## GPU-enabled Apache Spark pools
17
22
18
23
To simplify the process for creating and managing pools, Azure Synapse takes care of pre-installing low-level libraries and setting up all the complex networking requirements between compute nodes. This integration allows users to get started with GPU- accelerated pools within just a few minutes. To learn more about how to create a GPU-accelerated pool, you can visit the quickstart on how to [create a GPU-accelerated pool](../quickstart-create-apache-gpu-pool-portal.md).
19
24
20
25
> [!NOTE]
21
26
> - GPU-accelerated pools can be created in workspaces located in East US, Australia East, and North Europe.
22
-
> - GPU-accelerated pools are only available with the Apache Spark 3.1 and 3.2 runtime.
27
+
> - GPU-accelerated pools are only available with the Apache Spark 3.1 (unsupported) and 3.2 runtime.
23
28
> - You might need to request a [limit increase](../spark/apache-spark-rapids-gpu.md#quotas-and-resource-constraints-in-azure-synapse-gpu-enabled-pools) in order to create GPU-enabled clusters.
Copy file name to clipboardExpand all lines: articles/synapse-analytics/machine-learning/tutorial-horovod-pytorch.md
+14-8Lines changed: 14 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ description: Tutorial on how to run distributed training with the Horovod Estima
4
4
ms.service: synapse-analytics
5
5
ms.subservice: machine-learning
6
6
ms.topic: tutorial
7
-
ms.date: 04/19/2022
7
+
ms.date: 02/27/2024
8
8
author: midesa
9
9
ms.author: midesa
10
10
---
@@ -13,18 +13,24 @@ ms.author: midesa
13
13
14
14
[Horovod](https://github.com/horovod/horovod) is a distributed training framework for libraries like TensorFlow and PyTorch. With Horovod, users can scale up an existing training script to run on hundreds of GPUs in just a few lines of code.
15
15
16
-
Within Azure Synapse Analytics, users can quickly get started with Horovod using the default Apache Spark 3 runtime.For Spark ML pipeline applications using PyTorch, users can use the horovod.spark estimator API. This notebook uses an Apache Spark dataframe to perform distributed training of a distributed neural network (DNN) model on MNIST dataset. This tutorial leverages PyTorch and the Horovod Estimator to run the training process.
16
+
Within Azure Synapse Analytics, users can quickly get started with Horovod using the default Apache Spark 3 runtime.For Spark ML pipeline applications using PyTorch, users can use the horovod.spark estimator API. This notebook uses an Apache Spark dataframe to perform distributed training of a distributed neural network (DNN) model on MNIST dataset. This tutorial uses PyTorch and the Horovod Estimator to run the training process.
17
17
18
18
## Prerequisites
19
19
20
20
-[Azure Synapse Analytics workspace](../get-started-create-workspace.md) with an Azure Data Lake Storage Gen2 storage account configured as the default storage. You need to be the *Storage Blob Data Contributor* of the Data Lake Storage Gen2 file system that you work with.
21
21
- Create a GPU-enabled Apache Spark pool in your Azure Synapse Analytics workspace. For details, see [Create a GPU-enabled Apache Spark pool in Azure Synapse](../spark/apache-spark-gpu-concept.md). For this tutorial, we suggest using the GPU-Large cluster size with 3 nodes.
22
22
23
+
> [!WARNING]
24
+
> - The GPU accelerated preview is limited to the [Azure Synapse 3.1 (unsupported)](../spark/apache-spark-3-runtime.md) and [Apache Spark 3.2 (EOLA)](../spark/apache-spark-32-runtime.md) runtimes.
25
+
> - Azure Synapse Runtime for Apache Spark 3.1 has reached its end of life (EOL) as of January 26, 2023, with official support discontinued effective January 26, 2024, and no further addressing of support tickets, bug fixes, or security updates beyond this date.
26
+
> - Azure Synapse Runtime for Apache Spark 3.2 has reached its end of life (EOL) as of July 8, 2023, with no further bug or feature fixes, but security fixes may be backported based on risk assessment, and it will be retired and disabled as of July 8, 2024.
27
+
28
+
23
29
## Configure the Apache Spark session
24
30
25
-
At the start of the session, we will need to configure a few Apache Spark settings. In most cases, we only needs to set the numExecutors and spark.rapids.memory.gpu.reserve. For very large models, users may also need to configure the ```spark.kryoserializer.buffer.max``` setting. For Tensorflow models, users will need to set the ```spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH``` to be true.
31
+
At the start of the session, we need to configure a few Apache Spark settings. In most cases, we only need to set the numExecutors and spark.rapids.memory.gpu.reserve. For large models, users may also need to configure the ```spark.kryoserializer.buffer.max``` setting. For Tensorflow models, users need to set the ```spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH``` to be true.
26
32
27
-
In the example below, you can see how the Spark configurations can be passed with the ```%%configure``` command. The detailed meaning of each parameter is explained in the [Apache Spark configuration documentation](https://spark.apache.org/docs/latest/configuration.html). The values provided below are the suggested, best practice values for Azure Synapse GPU-large pools.
33
+
In the example, you can see how the Spark configurations can be passed with the ```%%configure``` command. The detailed meaning of each parameter is explained in the [Apache Spark configuration documentation](https://spark.apache.org/docs/latest/configuration.html). The values provided are the suggested, best practice values for Azure Synapse GPU-large pools.
28
34
29
35
```spark
30
36
@@ -61,7 +67,7 @@ For this tutorial, we will use the following configurations:
61
67
62
68
## Import dependencies
63
69
64
-
In this tutorial, we will leverage PySpark to read and process the dataset. We will then use PyTorch and Horovod to build the distributed neural network (DNN) model and run the training process. To get started, we will need to import the following dependencies:
70
+
In this tutorial, we use PySpark to read and process the dataset. Then, we use PyTorch and Horovod to build the distributed neural network (DNN) model and run the training process. To get started, we need to import the following dependencies:
65
71
66
72
```python
67
73
# base libs
@@ -94,7 +100,7 @@ from azure.synapse.ml.horovodutils import AdlsStore
94
100
95
101
## Connect to alternative storage account
96
102
97
-
We will need the Azure Data Lake Storage (ADLS) account for storing intermediate and model data. If you are using an alternative storage account, be sure to set up the [linked service](../../data-factory/concepts-linked-services.md) to automatically authenticate and read from the account. In addition, you will need to modify the following properties below: ```remote_url```, ```account_name```, and ```linked_service_name```.
103
+
We need the Azure Data Lake Storage (ADLS) account for storing intermediate and model data. If you are using an alternative storage account, be sure to set up the [linked service](../../data-factory/concepts-linked-services.md) to automatically authenticate and read from the account. In addition, you need to modify the following properties: ```remote_url```, ```account_name```, and ```linked_service_name```.
98
104
99
105
```python
100
106
num_proc =3# equal to numExecutors
@@ -164,7 +170,7 @@ train_df.count()
164
170
165
171
## Define DNN model
166
172
167
-
Once we have finished processing our dataset, we can now define our PyTorch model. The same code could also be used to train a single-node PyTorch model.
173
+
Once we are finished processing our dataset, we can now define our PyTorch model. The same code could also be used to train a single-node PyTorch model.
168
174
169
175
```python
170
176
# Define the PyTorch model without any Horovod-specific parameters
0 commit comments