Skip to content

Commit 5f5b425

Browse files
Merge pull request #520 from fbsolo-ms1/freshness-update-for-YP
Freshness update for interactive-data-wrangling-with-apache-spark-azure-ml.md . . .
2 parents abf10b8 + f3f44c2 commit 5f5b425

File tree

1 file changed

+41
-38
lines changed

1 file changed

+41
-38
lines changed

articles/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml.md

Lines changed: 41 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -8,81 +8,84 @@ ms.reviewer: yogipandey
88
ms.service: azure-machine-learning
99
ms.subservice: mldata
1010
ms.topic: how-to
11-
ms.date: 10/05/2023
12-
ms.custom: template-how-to
11+
ms.date: 09/26/2024
12+
ms.custom: template-how-to
1313
---
1414

1515
# Interactive Data Wrangling with Apache Spark in Azure Machine Learning
1616

17-
Data wrangling becomes one of the most important steps in machine learning projects. The Azure Machine Learning integration, with Azure Synapse Analytics, provides access to an Apache Spark pool - backed by Azure Synapse - for interactive data wrangling using Azure Machine Learning Notebooks.
17+
Data wrangling becomes one of the most important aspects of machine learning projects. The integration of Azure Machine Learning integration with Azure Synapse Analytics provides access to an Apache Spark pool - backed by Azure Synapse - for interactive data wrangling that uses Azure Machine Learning Notebooks.
1818

19-
In this article, you'll learn how to perform data wrangling using
19+
In this article, you learn how to handle data wrangling using
2020

2121
- Serverless Spark compute
2222
- Attached Synapse Spark pool
2323

2424
## Prerequisites
2525
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
26-
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
27-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](/azure/storage/blobs/create-data-lake-storage-account).
28-
- (Optional): An Azure Key Vault. See [Create an Azure Key Vault](/azure/key-vault/general/quick-create-portal).
29-
- (Optional): A Service Principal. See [Create a Service Principal](/azure/active-directory/develop/howto-create-service-principal-portal).
30-
- [(Optional): An attached Synapse Spark pool in the Azure Machine Learning workspace](./how-to-manage-synapse-spark-pool.md).
26+
- An Azure Machine Learning workspace. Visit [Create workspace resources](./quickstart-create-resources.md) for more information.
27+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. Visit [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](/azure/storage/blobs/create-data-lake-storage-account) for more information.
28+
- (Optional): An Azure Key Vault. Visit [Create an Azure Key Vault](/azure/key-vault/general/quick-create-portal) for more information.
29+
- (Optional): A Service Principal. Visit [Create a Service Principal](/azure/active-directory/develop/howto-create-service-principal-portal) for more information.
30+
- (Optional): [An attached Synapse Spark pool in the Azure Machine Learning workspace](./how-to-manage-synapse-spark-pool.md).
3131

3232
Before you start your data wrangling tasks, learn about the process of storing secrets
3333

3434
- Azure Blob storage account access key
3535
- Shared Access Signature (SAS) token
3636
- Azure Data Lake Storage (ADLS) Gen 2 service principal information
3737

38-
in the Azure Key Vault. You also need to know how to handle role assignments in the Azure storage accounts. The following sections review these concepts. Then, we'll explore the details of interactive data wrangling using the Spark pools in Azure Machine Learning Notebooks.
38+
in the Azure Key Vault. You also need to know how to handle role assignments in the Azure storage accounts. The following sections in this document describe these concepts. Then, we explore the details of interactive data wrangling, using the Spark pools in Azure Machine Learning Notebooks.
3939

4040
> [!TIP]
41-
> To learn about Azure storage account role assignment configuration, or if you access data in your storage accounts using user identity passthrough, see [Add role assignments in Azure storage accounts](./apache-spark-environment-configuration.md#add-role-assignments-in-azure-storage-accounts).
41+
> To learn about Azure storage account role assignment configuration, or if you access data in your storage accounts using user identity passthrough, visit [Add role assignments in Azure storage accounts](./apache-spark-environment-configuration.md#add-role-assignments-in-azure-storage-accounts) for more information.
4242
4343
## Interactive Data Wrangling with Apache Spark
4444

45-
Azure Machine Learning offers serverless Spark compute, and [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md), for interactive data wrangling with Apache Spark in Azure Machine Learning Notebooks. The serverless Spark compute doesn't require creation of resources in the Azure Synapse workspace. Instead, a fully managed serverless Spark compute becomes directly available in the Azure Machine Learning Notebooks. Using a serverless Spark compute is the easiest approach to access a Spark cluster in Azure Machine Learning.
45+
For interactive data wrangling with Apache Spark in Azure Machine Learning Notebooks, Azure Machine Learning offers serverless Spark compute and [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md). The serverless Spark compute doesn't require creation of resources in the Azure Synapse workspace. Instead, a fully managed serverless Spark compute becomes directly available in the Azure Machine Learning Notebooks. Use of a serverless Spark compute is the easiest way to access a Spark cluster in Azure Machine Learning.
4646

4747
### Serverless Spark compute in Azure Machine Learning Notebooks
4848

4949
A serverless Spark compute is available in Azure Machine Learning Notebooks by default. To access it in a notebook, select **Serverless Spark Compute** under **Azure Machine Learning Serverless Spark** from the **Compute** selection menu.
5050

51-
The Notebooks UI also provides options for Spark session configuration, for the serverless Spark compute. To configure a Spark session:
51+
The Notebooks UI also provides options for Spark session configuration for the serverless Spark compute. To configure a Spark session:
5252

5353
1. Select **Configure session** at the top of the screen.
54-
2. Select **Apache Spark version** from the dropdown menu.
54+
1. Select **Apache Spark version** from the dropdown menu.
5555
> [!IMPORTANT]
5656
> Azure Synapse Runtime for Apache Spark: Announcements
5757
> * Azure Synapse Runtime for Apache Spark 3.2:
5858
> * EOLA Announcement Date: July 8, 2023
5959
> * End of Support Date: July 8, 2024. After this date, the runtime will be disabled.
60-
> * For continued support and optimal performance, we advise that you migrate to Apache Spark 3.3.
61-
3. Select **Instance type** from the dropdown menu. The following instance types are currently supported:
60+
> * Apache Spark 3.3:
61+
> * EOLA Announcement Date: July 12, 2024
62+
> * End of Support Date: March 31, 2025. After this date, the runtime will be disabled.
63+
> * For continued support and optimal performance, we advise migration to **Apache Spark 3.4**
64+
1. Select **Instance type** from the dropdown menu. These types are currently supported:
6265
- `Standard_E4s_v3`
6366
- `Standard_E8s_v3`
6467
- `Standard_E16s_v3`
6568
- `Standard_E32s_v3`
6669
- `Standard_E64s_v3`
67-
4. Input a Spark **Session timeout** value, in minutes.
68-
5. Select whether to **Dynamically allocate executors**
69-
6. Select the number of **Executors** for the Spark session.
70-
7. Select **Executor size** from the dropdown menu.
71-
8. Select **Driver size** from the dropdown menu.
72-
9. To use a Conda file to configure a Spark session, check the **Upload conda file** checkbox. Then, select **Browse**, and choose the Conda file with the Spark session configuration you want.
73-
10. Add **Configuration settings** properties, input values in the **Property** and **Value** textboxes, and select **Add**.
74-
11. Select **Apply**.
75-
12. Select **Stop session** in the **Configure new session?** pop-up.
70+
1. Input a Spark **Session timeout** value, in minutes.
71+
1. Select whether or not you want to **Dynamically allocate executors**
72+
1. Select the number of **Executors** for the Spark session.
73+
1. Select **Executor size** from the dropdown menu.
74+
1. Select **Driver size** from the dropdown menu.
75+
1. To use a Conda file to configure a Spark session, check the **Upload conda file** checkbox. Then, select **Browse**, and choose the Conda file with the Spark session configuration you want.
76+
1. Add **Configuration settings** properties, input values in the **Property** and **Value** textboxes, and select **Add**.
77+
1. Select **Apply**.
78+
1. In the **Configure new session?** pop-up, select **Stop session**.
7679

7780
The session configuration changes persist and become available to another notebook session that is started using the serverless Spark compute.
7881

7982
> [!TIP]
8083
>
81-
> If you use session-level Conda packages, you can [improve](./apache-spark-azure-ml-concepts.md#improving-session-cold-start-time-while-using-session-level-conda-packages) the Spark session *cold start* time if you set the configuration variable `spark.hadoop.aml.enable_cache` to true. A session cold start with session level Conda packages typically takes 10 to 15 minutes when the session starts for the first time. However, subsequent session cold starts with the configuration variable set to true typically take three to five minutes.
84+
> If you use session-level Conda packages, you can [improve](./apache-spark-azure-ml-concepts.md#improving-session-cold-start-time-while-using-session-level-conda-packages) the Spark session *cold start* time if you set the configuration variable `spark.hadoop.aml.enable_cache` to **true**. A session cold start with session level Conda packages typically takes 10 to 15 minutes when the session starts for the first time. However, subsequent session cold starts with the configuration variable set to true typically take three to five minutes.
8285
8386
### Import and wrangle data from Azure Data Lake Storage (ADLS) Gen 2
8487

85-
You can access and wrangle data stored in Azure Data Lake Storage (ADLS) Gen 2 storage accounts with `abfss://` data URIs following one of the two data access mechanisms:
88+
You can access and wrangle data stored in Azure Data Lake Storage (ADLS) Gen 2 storage accounts with `abfss://` data URIs. To do this, you must follow one of the two data access mechanisms:
8689

8790
- User identity passthrough
8891
- Service principal-based data access
@@ -127,9 +130,9 @@ To start interactive data wrangling with the user identity passthrough:
127130
To wrangle data by access through a service principal:
128131

129132
1. Verify that the service principal has **Contributor** and **Storage Blob Data Contributor** [role assignments](./apache-spark-environment-configuration.md#add-role-assignments-in-azure-storage-accounts) in the Azure Data Lake Storage (ADLS) Gen 2 storage account.
130-
2. [Create Azure Key Vault secrets](./apache-spark-environment-configuration.md#store-azure-storage-account-credentials-as-secrets-in-azure-key-vault) for the service principal tenant ID, client ID and client secret values.
131-
3. Select **Serverless Spark compute** under **Azure Machine Learning Serverless Spark** from the **Compute** selection menu, or select an attached Synapse Spark pool under **Synapse Spark pools** from the **Compute** selection menu.
132-
4. To set the service principal tenant ID, client ID and client secret in the configuration, and execute the following code sample.
133+
1. [Create Azure Key Vault secrets](./apache-spark-environment-configuration.md#store-azure-storage-account-credentials-as-secrets-in-azure-key-vault) for the service principal tenant ID, client ID and client secret values.
134+
1. In the **Compute** selection menu, select **Serverless Spark compute** under **Azure Machine Learning Serverless Spark**. You can also select an attached Synapse Spark pool under **Synapse Spark pools** from the **Compute** selection menu.
135+
1. Set the service principal tenant ID, client ID and client secret values in the configuration, and execute the following code sample.
133136
- The `get_secret()` call in the code depends on name of the Azure Key Vault, and the names of the Azure Key Vault secrets created for the service principal tenant ID, client ID and client secret. Set these corresponding property name/values in the configuration:
134137
- Client ID property: `fs.azure.account.oauth2.client.id.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net`
135138
- Client secret property: `fs.azure.account.oauth2.client.secret.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net`
@@ -169,18 +172,18 @@ To wrangle data by access through a service principal:
169172
)
170173
```
171174

172-
5. Import and wrangle data using data URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` as shown in the code sample, using the Titanic data.
175+
1. Using the Titanic data, import and the wrangle data using the data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format, as shown in the code sample.
173176

174177
### Import and wrangle data from Azure Blob storage
175178

176179
You can access Azure Blob storage data with either the storage account access key or a shared access signature (SAS) token. You should [store these credentials in the Azure Key Vault as a secret](./apache-spark-environment-configuration.md#store-azure-storage-account-credentials-as-secrets-in-azure-key-vault), and set them as properties in the session configuration.
177180

178181
To start interactive data wrangling:
179182
1. At the Azure Machine Learning studio left panel, select **Notebooks**.
180-
1. Select **Serverless Spark compute** under **Azure Machine Learning Serverless Spark** from the **Compute** selection menu, or select an attached Synapse Spark pool under **Synapse Spark pools** from the **Compute** selection menu.
183+
1. In the **Compute** selection menu, select **Serverless Spark compute** under **Azure Machine Learning Serverless Spark**. You can also select an attached Synapse Spark pool under **Synapse Spark pools** from the **Compute** selection menu.
181184
1. To configure the storage account access key or a shared access signature (SAS) token for data access in Azure Machine Learning Notebooks:
182185

183-
- For the access key, set property `fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net` as shown in this code snippet:
186+
- For the access key, set the `fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net` property, as shown in this code snippet:
184187

185188
```python
186189
from pyspark.sql import SparkSession
@@ -192,7 +195,7 @@ To start interactive data wrangling:
192195
"fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net", access_key
193196
)
194197
```
195-
- For the SAS token, set property `fs.azure.sas.<BLOB_CONTAINER_NAME>.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net` as shown in this code snippet:
198+
- For the SAS token, set the `fs.azure.sas.<BLOB_CONTAINER_NAME>.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net` property, as shown in this code snippet:
196199

197200
```python
198201
from pyspark.sql import SparkSession
@@ -206,7 +209,7 @@ To start interactive data wrangling:
206209
)
207210
```
208211
> [!NOTE]
209-
> The `get_secret()` calls in the above code snippets require the name of the Azure Key Vault, and the names of the secrets created for the Azure Blob storage account access key or SAS token
212+
> The `get_secret()` calls in the earlier code snippets require the name of the Azure Key Vault, and the names of the secrets created for the Azure Blob storage account access key or SAS token.
210213

211214
2. Execute the data wrangling code in the same notebook. Format the data URI as `wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/<PATH_TO_DATA>`, similar to what this code snippet shows:
212215

@@ -239,7 +242,7 @@ To start interactive data wrangling:
239242
To access data from [Azure Machine Learning Datastore](how-to-datastore.md), define a path to data on the datastore with [URI format](how-to-create-data-assets.md?tabs=cli#create-data-assets) `azureml://datastores/<DATASTORE_NAME>/paths/<PATH_TO_DATA>`. To wrangle data from an Azure Machine Learning Datastore in a Notebooks session interactively:
240243

241244
1. Select **Serverless Spark compute** under **Azure Machine Learning Serverless Spark** from the **Compute** selection menu, or select an attached Synapse Spark pool under **Synapse Spark pools** from the **Compute** selection menu.
242-
2. This code sample shows how to read and wrangle Titanic data from an Azure Machine Learning Datastore, using `azureml://` datastore URI, `pyspark.pandas` and `pyspark.ml.feature.Imputer`.
245+
1. This code sample shows how to read and wrangle Titanic data from an Azure Machine Learning Datastore, using `azureml://` datastore URI, `pyspark.pandas`, and `pyspark.ml.feature.Imputer`.
243246

244247
```python
245248
import pyspark.pandas as pd
@@ -271,7 +274,7 @@ The Azure Machine Learning datastores can access data using Azure storage accoun
271274
- SAS token
272275
- service principal
273276

274-
or provide credential-less data access. Depending on the datastore type and the underlying Azure storage account type, select an appropriate authentication mechanism to ensure data access. This table summarizes the authentication mechanisms to access data in the Azure Machine Learning datastores:
277+
or they use credential-less data access. Depending on the datastore type and the underlying Azure storage account type, select an appropriate authentication mechanism to ensure data access. This table summarizes the authentication mechanisms to access data in the Azure Machine Learning datastores:
275278

276279
|Storage account type|Credential-less data access|Data access mechanism|Role assignments|
277280
| ------------------------ | ------------------------ | ------------------------ | ------------------------ |
@@ -288,7 +291,7 @@ The default file share is mounted to both serverless Spark compute and attached
288291

289292
:::image type="content" source="media/interactive-data-wrangling-with-apache-spark-azure-ml/default-file-share.png" alt-text="Screenshot showing use of a file share.":::
290293

291-
In Azure Machine Learning studio, files in the default file share are shown in the directory tree under the **Files** tab. Notebook code can directly access files stored in this file share with `file://` protocol, along with the absolute path of the file, without more configurations. This code snippet shows how to access a file stored on the default file share:
294+
In Azure Machine Learning studio, files in the default file share are shown in the directory tree under the **Files** tab. Notebook code can directly access files stored in this file share with the `file://` protocol, along with the absolute path of the file, without more configurations. This code snippet shows how to access a file stored on the default file share:
292295

293296
```python
294297
import os

0 commit comments

Comments
 (0)