Skip to content

Commit 385136d

Browse files
Merge pull request #226417 from fbsolo-ms1/updates-for-YP
Yogi P requested a new doc . . .
2 parents e0a58b2 + 0337e44 commit 385136d

9 files changed

+114
-0
lines changed
59.8 KB
Loading
66.6 KB
Loading
42.2 KB
Loading
225 KB
Loading
169 KB
Loading
73.8 KB
Loading
89.8 KB
Loading
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: "Quickstart: Interactive Data Wrangling with Apache Spark (preview)"
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to perform interactive data wrangling with Apache Spark in Azure Machine Learning
5+
author: ynpandey
6+
ms.author: franksolomon
7+
ms.reviewer: franksolomon
8+
ms.service: machine-learning
9+
ms.subservice: mldata
10+
ms.topic: quickstart
11+
ms.date: 02/06/2023
12+
#Customer intent: As a Full Stack ML Pro, I want to perform interactive data wrangling in Azure Machine Learning, with Apache Spark.
13+
---
14+
15+
# Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
16+
17+
[!INCLUDE [preview disclaimer](../../includes/machine-learning-preview-generic-disclaimer.md)]
18+
19+
20+
To handle interactive Azure Machine Learning notebook data wrangling, Azure Machine Learning integration, with Azure Synapse Analytics (preview), provides easy access to the Apache Spark framework. This access allows for Azure Machine Learning Notebook interactive data wrangling.
21+
22+
In this quickstart guide, you'll learn how to perform interactive data wrangling using Azure Machine Learning Managed (Automatic) Synapse Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough.
23+
24+
## Prerequisites
25+
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
26+
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
27+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
28+
- To enable this feature:
29+
1. Navigate to the Azure Machine Learning studio UI
30+
2. In the icon section at the top right of the screen, select **Manage preview features** (megaphone icon)
31+
3. In the **Managed preview feature** panel, toggle the **Run notebooks and jobs on managed Spark** feature to **on**
32+
:::image type="content" source="media/quickstart-spark-data-wrangling/how-to-enable-managed-spark-preview.png" lightbox="media/quickstart-spark-data-wrangling/how-to-enable-managed-spark-preview.png" alt-text="Screenshot showing the option to enable the Managed Spark preview.":::
33+
34+
## Add role assignments in Azure storage accounts
35+
36+
We must ensure that the input and output data paths are accessible, before we start interactive data wrangling. To enable read and write access, assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity of the logged-in user.
37+
38+
To assign appropriate roles to the user identity:
39+
40+
1. In the Microsoft Azure portal, navigate to the Azure Data Lake Storage (ADLS) Gen 2 storage account page
41+
1. Select **Access Control (IAM)** from the left panel
42+
1. Select **Add role assignment**
43+
44+
:::image type="content" source="media/quickstart-spark-data-wrangling/storage-account-add-role-assignment.png" lightbox="media/quickstart-spark-data-wrangling/storage-account-add-role-assignment.png" alt-text="Screenshot showing the Azure access keys screen.":::
45+
46+
1. Find and select role **Storage Blob Data Contributor**
47+
1. Select **Next**
48+
49+
:::image type="content" source="media/quickstart-spark-data-wrangling/add-role-assignment-choose-role.png" lightbox="media/quickstart-spark-data-wrangling/add-role-assignment-choose-role.png" alt-text="Screenshot showing the Azure add role assignment screen.":::
50+
51+
1. Select **User, group, or service principal**.
52+
1. Select **+ Select members**.
53+
1. Search for the user identity below **Select**
54+
1. Select the user identity from the list, so that it shows under **Selected members**
55+
1. Select the appropriate user identity
56+
1. Select **Next**
57+
58+
:::image type="content" source="media/quickstart-spark-data-wrangling/add-role-assignment-choose-members.png" lightbox="media/quickstart-spark-data-wrangling/add-role-assignment-choose-members.png" alt-text="Screenshot showing the Azure add role assignment screen Members tab.":::
59+
60+
1. Select **Review + Assign**
61+
62+
:::image type="content" source="media/quickstart-spark-data-wrangling/add-role-assignment-review-and-assign.png" lightbox="media/quickstart-spark-data-wrangling/add-role-assignment-review-and-assign.png" alt-text="Screenshot showing the Azure add role assignment screen review and assign tab.":::
63+
1. Repeat steps 2-13 for **Contributor** role assignment.
64+
65+
Once the user identity has the appropriate roles assigned, data in the Azure storage account should become accessible.
66+
67+
## Managed (Automatic) Spark compute in Azure Machine Learning Notebooks
68+
69+
A Managed (Automatic) Spark compute is available in Azure Machine Learning Notebooks by default. To access it in a notebook, start in the **Compute** selection menu, and select **AzureML Spark Compute** under **Azure Machine Learning Spark**.
70+
71+
:::image type="content" source="media/quickstart-spark-data-wrangling/select-azure-ml-spark-compute.png" lightbox="media/quickstart-spark-data-wrangling/select-azure-ml-spark-compute.png" alt-text="Screenshot highlighting the selected Azure Machine Learning Spark option, located at the Compute selection menu.":::
72+
73+
## Interactive data wrangling with Titanic data
74+
75+
> [!TIP]
76+
> Data wrangling with a Managed (Automatic) Spark compute, and user identity passthrough for data access in a Azure Data Lake Storage (ADLS) Gen 2 storage account, both require the lowest number of configuration steps.
77+
78+
The data wrangling code shown here uses the `titanic.csv` file, available [here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/spark/data/titanic.csv). Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account. This Python code snippet shows interactive data wrangling with an Azure Machine Learning Managed (Automatic) Spark compute, user identity passthrough, and an input/output data URI, in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`:
79+
80+
```python
81+
import pyspark.pandas as pd
82+
from pyspark.ml.feature import Imputer
83+
84+
df = pd.read_csv(
85+
"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/titanic.csv",
86+
index_col="PassengerId",
87+
)
88+
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
89+
"mean"
90+
) # Replace missing values in Age column with the mean value
91+
df.fillna(
92+
value={"Cabin": "None"}, inplace=True
93+
) # Fill Cabin column with value "None" if missing
94+
df.dropna(inplace=True) # Drop the rows which still have any missing value
95+
df.to_csv(
96+
"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/wrangled",
97+
index_col="PassengerId",
98+
)
99+
```
100+
101+
> [!NOTE]
102+
> Only the Spark runtime version 3.2 supports `pyspark.pandas`, used in this Python code sample.
103+
104+
:::image type="content" source="media/quickstart-spark-data-wrangling/managed-spark-interactive-data-wrangling.png" lightbox="media/quickstart-spark-data-wrangling/managed-spark-interactive-data-wrangling.png" alt-text="Screenshot showing use of a Managed (Automatic) Spark compute, for interactive data wrangling.":::
105+
106+
## Next steps
107+
- [Apache Spark in Azure Machine Learning (preview)](./apache-spark-azure-ml-concepts.md)
108+
- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
109+
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
110+
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)
111+
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
112+
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)

articles/machine-learning/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,8 @@
6464
items:
6565
- name: Create ML resources to get started
6666
href: quickstart-create-resources.md
67+
- name: "Quickstart: Interactive Data Wrangling with Apache Spark (preview)"
68+
href: quickstart-spark-data-wrangling.md
6769
- name: "Quickstart: Submit Apache Spark jobs in Azure Machine Learning (preview)"
6870
href: quickstart-spark-jobs.md
6971
- name: Run Jupyter notebooks

0 commit comments

Comments
 (0)