|
| 1 | +--- |
| 2 | +title: "Quickstart: Interactive Data Wrangling with Apache Spark (preview)" |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Learn how to perform interactive data wrangling with Apache Spark in Azure Machine Learning |
| 5 | +author: ynpandey |
| 6 | +ms.author: franksolomon |
| 7 | +ms.reviewer: franksolomon |
| 8 | +ms.service: machine-learning |
| 9 | +ms.subservice: mldata |
| 10 | +ms.topic: quickstart |
| 11 | +ms.date: 02/06/2023 |
| 12 | +#Customer intent: As a Full Stack ML Pro, I want to perform interactive data wrangling in Azure Machine Learning, with Apache Spark. |
| 13 | +--- |
| 14 | + |
| 15 | +# Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview) |
| 16 | + |
| 17 | +[!INCLUDE [preview disclaimer](../../includes/machine-learning-preview-generic-disclaimer.md)] |
| 18 | + |
| 19 | + |
| 20 | +To handle interactive Azure Machine Learning notebook data wrangling, Azure Machine Learning integration, with Azure Synapse Analytics (preview), provides easy access to the Apache Spark framework. This access allows for Azure Machine Learning Notebook interactive data wrangling. |
| 21 | + |
| 22 | +In this quickstart guide, you'll learn how to perform interactive data wrangling using Azure Machine Learning Managed (Automatic) Synapse Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough. |
| 23 | + |
| 24 | +## Prerequisites |
| 25 | +- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin. |
| 26 | +- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md). |
| 27 | +- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md). |
| 28 | +- To enable this feature: |
| 29 | + 1. Navigate to the Azure Machine Learning studio UI |
| 30 | + 2. In the icon section at the top right of the screen, select **Manage preview features** (megaphone icon) |
| 31 | + 3. In the **Managed preview feature** panel, toggle the **Run notebooks and jobs on managed Spark** feature to **on** |
| 32 | + :::image type="content" source="media/quickstart-spark-data-wrangling/how-to-enable-managed-spark-preview.png" lightbox="media/quickstart-spark-data-wrangling/how-to-enable-managed-spark-preview.png" alt-text="Screenshot showing the option to enable the Managed Spark preview."::: |
| 33 | + |
| 34 | +## Add role assignments in Azure storage accounts |
| 35 | + |
| 36 | +We must ensure that the input and output data paths are accessible, before we start interactive data wrangling. To enable read and write access, assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity of the logged-in user. |
| 37 | + |
| 38 | +To assign appropriate roles to the user identity: |
| 39 | + |
| 40 | +1. In the Microsoft Azure portal, navigate to the Azure Data Lake Storage (ADLS) Gen 2 storage account page |
| 41 | +1. Select **Access Control (IAM)** from the left panel |
| 42 | +1. Select **Add role assignment** |
| 43 | + |
| 44 | + :::image type="content" source="media/quickstart-spark-data-wrangling/storage-account-add-role-assignment.png" lightbox="media/quickstart-spark-data-wrangling/storage-account-add-role-assignment.png" alt-text="Screenshot showing the Azure access keys screen."::: |
| 45 | + |
| 46 | +1. Find and select role **Storage Blob Data Contributor** |
| 47 | +1. Select **Next** |
| 48 | + |
| 49 | + :::image type="content" source="media/quickstart-spark-data-wrangling/add-role-assignment-choose-role.png" lightbox="media/quickstart-spark-data-wrangling/add-role-assignment-choose-role.png" alt-text="Screenshot showing the Azure add role assignment screen."::: |
| 50 | + |
| 51 | +1. Select **User, group, or service principal**. |
| 52 | +1. Select **+ Select members**. |
| 53 | +1. Search for the user identity below **Select** |
| 54 | +1. Select the user identity from the list, so that it shows under **Selected members** |
| 55 | +1. Select the appropriate user identity |
| 56 | +1. Select **Next** |
| 57 | + |
| 58 | + :::image type="content" source="media/quickstart-spark-data-wrangling/add-role-assignment-choose-members.png" lightbox="media/quickstart-spark-data-wrangling/add-role-assignment-choose-members.png" alt-text="Screenshot showing the Azure add role assignment screen Members tab."::: |
| 59 | + |
| 60 | +1. Select **Review + Assign** |
| 61 | + |
| 62 | + :::image type="content" source="media/quickstart-spark-data-wrangling/add-role-assignment-review-and-assign.png" lightbox="media/quickstart-spark-data-wrangling/add-role-assignment-review-and-assign.png" alt-text="Screenshot showing the Azure add role assignment screen review and assign tab."::: |
| 63 | +1. Repeat steps 2-13 for **Contributor** role assignment. |
| 64 | + |
| 65 | +Once the user identity has the appropriate roles assigned, data in the Azure storage account should become accessible. |
| 66 | + |
| 67 | +## Managed (Automatic) Spark compute in Azure Machine Learning Notebooks |
| 68 | + |
| 69 | +A Managed (Automatic) Spark compute is available in Azure Machine Learning Notebooks by default. To access it in a notebook, start in the **Compute** selection menu, and select **AzureML Spark Compute** under **Azure Machine Learning Spark**. |
| 70 | + |
| 71 | +:::image type="content" source="media/quickstart-spark-data-wrangling/select-azure-ml-spark-compute.png" lightbox="media/quickstart-spark-data-wrangling/select-azure-ml-spark-compute.png" alt-text="Screenshot highlighting the selected Azure Machine Learning Spark option, located at the Compute selection menu."::: |
| 72 | + |
| 73 | +## Interactive data wrangling with Titanic data |
| 74 | + |
| 75 | +> [!TIP] |
| 76 | +> Data wrangling with a Managed (Automatic) Spark compute, and user identity passthrough for data access in a Azure Data Lake Storage (ADLS) Gen 2 storage account, both require the lowest number of configuration steps. |
| 77 | +
|
| 78 | +The data wrangling code shown here uses the `titanic.csv` file, available [here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/spark/data/titanic.csv). Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account. This Python code snippet shows interactive data wrangling with an Azure Machine Learning Managed (Automatic) Spark compute, user identity passthrough, and an input/output data URI, in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`: |
| 79 | + |
| 80 | +```python |
| 81 | +import pyspark.pandas as pd |
| 82 | +from pyspark.ml.feature import Imputer |
| 83 | + |
| 84 | +df = pd.read_csv( |
| 85 | + "abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/titanic.csv", |
| 86 | + index_col="PassengerId", |
| 87 | +) |
| 88 | +imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy( |
| 89 | + "mean" |
| 90 | +) # Replace missing values in Age column with the mean value |
| 91 | +df.fillna( |
| 92 | + value={"Cabin": "None"}, inplace=True |
| 93 | +) # Fill Cabin column with value "None" if missing |
| 94 | +df.dropna(inplace=True) # Drop the rows which still have any missing value |
| 95 | +df.to_csv( |
| 96 | + "abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/wrangled", |
| 97 | + index_col="PassengerId", |
| 98 | +) |
| 99 | +``` |
| 100 | + |
| 101 | +> [!NOTE] |
| 102 | +> Only the Spark runtime version 3.2 supports `pyspark.pandas`, used in this Python code sample. |
| 103 | +
|
| 104 | +:::image type="content" source="media/quickstart-spark-data-wrangling/managed-spark-interactive-data-wrangling.png" lightbox="media/quickstart-spark-data-wrangling/managed-spark-interactive-data-wrangling.png" alt-text="Screenshot showing use of a Managed (Automatic) Spark compute, for interactive data wrangling."::: |
| 105 | + |
| 106 | +## Next steps |
| 107 | +- [Apache Spark in Azure Machine Learning (preview)](./apache-spark-azure-ml-concepts.md) |
| 108 | +- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md) |
| 109 | +- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md) |
| 110 | +- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md) |
| 111 | +- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark) |
| 112 | +- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark) |
0 commit comments