Skip to content

Commit c171e22

Browse files
Merge pull request #229792 from fbsolo-ms1/tutorial-for-SK
Update: Interactive Data Wrangling w/Apache Spark
2 parents e832a23 + 29daa95 commit c171e22

18 files changed

+148
-215
lines changed

articles/machine-learning/.openpublishing.redirection.machine-learning.json

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
{
22
"redirections": [
33
{
4-
"source_path_from_root": "/articles/machine-learning/how-to-train-with-custom-image.md",
5-
"redirect_url": "/azure/machine-learning/v1/how-to-train-with-custom-image",
6-
"redirect_document_id": true
4+
"source_path_from_root": "/articles/machine-learning/quickstart-spark-data-wrangling.md",
5+
"redirect_url": "/azure/machine-learning/apache-spark-environment-configuration",
6+
"redirect_document_id": true
7+
},
8+
{
9+
"source_path_from_root": "/articles/machine-learning/how-to-train-with-custom-image.md",
10+
"redirect_url": "/azure/machine-learning/v1/how-to-train-with-custom-image",
11+
"redirect_document_id": true
712
},
813
{
914
"source_path_from_root": "/articles/machine-learning/how-to-monitor-tensorboard.md",
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: Apache Spark - Environment Configuration
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to configure your Apache Spark environment for interactive data wrangling
5+
author: ynpandey
6+
ms.author: franksolomon
7+
ms.reviewer: franksolomon
8+
ms.service: machine-learning
9+
ms.subservice: mldata
10+
ms.topic: how-to
11+
ms.date: 03/06/2023
12+
#Customer intent: As a Full Stack ML Pro, I want to perform interactive data wrangling in Azure Machine Learning with Apache Spark.
13+
---
14+
15+
# Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
16+
17+
[!INCLUDE [preview disclaimer](../../includes/machine-learning-preview-generic-disclaimer.md)]
18+
19+
To handle interactive Azure Machine Learning notebook data wrangling, Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to the Apache Spark framework. This access allows for Azure Machine Learning Notebook interactive data wrangling.
20+
21+
In this quickstart guide, you learn how to perform interactive data wrangling using Azure Machine Learning Managed (Automatic) Synapse Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough.
22+
23+
## Prerequisites
24+
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
25+
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
26+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
27+
- To enable this feature:
28+
1. Navigate to the Azure Machine Learning studio UI
29+
2. In the icon section at the top right of the screen, select **Manage preview features** (megaphone icon)
30+
3. In the **Managed preview feature** panel, toggle the **Run notebooks and jobs on managed Spark** feature to **on**
31+
:::image type="content" source="./media/apache-spark-environment-configuration/how-to-enable-managed-spark-preview.png" lightbox="media/apache-spark-environment-configuration/how-to-enable-managed-spark-preview.png" alt-text="Screenshot showing the option to enable the Managed Spark preview.":::
32+
33+
## Store Azure storage account credentials as secrets in Azure Key Vault
34+
35+
To store Azure storage account credentials as secrets in the Azure Key Vault using the Azure portal user interface:
36+
37+
1. Navigate to your Azure Key Vault in the Azure portal.
38+
1. Select **Secrets** from the left panel.
39+
1. Select **+ Generate/Import**.
40+
41+
:::image type="content" source="media/apache-spark-environment-configuration/azure-key-vault-secrets-generate-import.png" alt-text="Screenshot showing the Azure Key Vault Secrets Generate Or Import tab.":::
42+
43+
1. At the **Create a secret** screen, enter a **Name** for the secret you want to create.
44+
1. Navigate to Azure Blob Storage Account, in the Azure portal, as seen in this image:
45+
46+
:::image type="content" source="media/apache-spark-environment-configuration/storage-account-access-keys.png" alt-text="Screenshot showing the Azure access key and connection string values screen.":::
47+
1. Select **Access keys** from the Azure Blob Storage Account page left panel.
48+
1. Select **Show** next to **Key 1**, and then **Copy to clipboard** to get the storage account access key.
49+
> [!Note]
50+
> Select appropriate options to copy
51+
> - Azure Blob storage container shared access signature (SAS) tokens
52+
> - Azure Data Lake Storage (ADLS) Gen 2 storage account service principal credentials
53+
> - tenant ID
54+
> - client ID and
55+
> - secret
56+
>
57+
> on the respective user interfaces while creating Azure Key Vault secrets for them.
58+
1. Navigate back to the **Create a secret** screen.
59+
1. In the **Secret value** textbox, enter the access key credential for the Azure storage account, which was copied to the clipboard in the earlier step.
60+
1. Select **Create**.
61+
62+
:::image type="content" source="media/apache-spark-environment-configuration/create-a-secret.png" alt-text="Screenshot showing the Azure secret creation screen.":::
63+
64+
> [!TIP]
65+
> [Azure CLI](../key-vault/secrets/quick-create-cli.md) and [Azure Key Vault secret client library for Python](../key-vault/secrets/quick-create-python.md#sign-in-to-azure) can also create Azure Key Vault secrets.
66+
67+
## Add role assignments in Azure storage accounts
68+
69+
We must ensure that the input and output data paths are accessible before we start interactive data wrangling. First, for
70+
71+
- the user identity of the Notebooks session logged-in user or
72+
- a service principal
73+
74+
assign **Reader** and **Storage Blob Data Reader** roles to the user identity of the logged-in user. However, in certain scenarios, we might want to write the wrangled data back to the Azure storage account. The **Reader** and **Storage Blob Data Reader** roles provide read-only access to the user identity or service principal. To enable read and write access, assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity or service principal. To assign appropriate roles to the user identity:
75+
76+
1. Open the [Microsoft Azure portal](https://portal.azure.com).
77+
1. Search and select the **Storage accounts** service.
78+
79+
:::image type="content" source="media/apache-spark-environment-configuration/find-storage-accounts-service.png" lightbox="media/apache-spark-environment-configuration/find-storage-accounts-service.png" alt-text="Expandable screenshot showing Storage accounts service search and selection, in Microsoft Azure portal.":::
80+
81+
1. On the **Storage accounts** page, select the Azure Data Lake Storage (ADLS) Gen 2 storage account from the list. A page showing the storage account **Overview** will open.
82+
83+
:::image type="content" source="media/apache-spark-environment-configuration/storage-accounts-list.png" lightbox="media/apache-spark-environment-configuration/storage-accounts-list.png" alt-text="Expandable screenshot showing selection of the Azure Data Lake Storage (ADLS) Gen 2 storage account Storage account.":::
84+
85+
1. Select **Access Control (IAM)** from the left panel
86+
1. Select **Add role assignment**
87+
88+
:::image type="content" source="media/apache-spark-environment-configuration/storage-account-add-role-assignment.png" lightbox="media/apache-spark-environment-configuration/storage-account-add-role-assignment.png" alt-text="Screenshot showing the Azure access keys screen.":::
89+
90+
1. Find and select role **Storage Blob Data Contributor**
91+
1. Select **Next**
92+
93+
:::image type="content" source="media/apache-spark-environment-configuration/add-role-assignment-choose-role.png" lightbox="media/apache-spark-environment-configuration/add-role-assignment-choose-role.png" alt-text="Screenshot showing the Azure add role assignment screen.":::
94+
95+
1. Select **User, group, or service principal**.
96+
1. Select **+ Select members**.
97+
1. Search for the user identity below **Select**
98+
1. Select the user identity from the list, so that it shows under **Selected members**
99+
1. Select the appropriate user identity
100+
1. Select **Next**
101+
102+
:::image type="content" source="media/apache-spark-environment-configuration/add-role-assignment-choose-members.png" lightbox="media/apache-spark-environment-configuration/add-role-assignment-choose-members.png" alt-text="Screenshot showing the Azure add role assignment screen Members tab.":::
103+
104+
1. Select **Review + Assign**
105+
106+
:::image type="content" source="media/apache-spark-environment-configuration/add-role-assignment-review-and-assign.png" lightbox="media/apache-spark-environment-configuration/add-role-assignment-review-and-assign.png" alt-text="Screenshot showing the Azure add role assignment screen review and assign tab.":::
107+
1. Repeat steps 2-13 for **Contributor** role assignment.
108+
109+
Once the user identity has the appropriate roles assigned, data in the Azure storage account should become accessible.
110+
111+
> [!NOTE]
112+
> If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace that has a managed virtual network associated with it, [a managed private endpoint to storage account should be configured](../synapse-analytics/security/connect-to-a-secure-storage-account.md) to ensure data access.
113+
114+
## Next steps
115+
- [Apache Spark in Azure Machine Learning (preview)](./apache-spark-azure-ml-concepts.md)
116+
- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
117+
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
118+
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)
119+
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
120+
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)

0 commit comments

Comments
 (0)