Skip to content

Commit 3a78b0a

Browse files
author
Jill Grant
authored
Merge pull request #272083 from fbsolo-ms1/update-data-science-virtual-machine-files
Update data science virtual machine files
2 parents 043f36a + b93da12 commit 3a78b0a

File tree

1 file changed

+57
-58
lines changed

1 file changed

+57
-58
lines changed

articles/machine-learning/quickstart-spark-jobs.md

Lines changed: 57 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,43 @@
11
---
22
title: "Configure Apache Spark jobs in Azure Machine Learning"
33
titleSuffix: Azure Machine Learning
4-
description: Learn how to submit Apache Spark jobs with Azure Machine Learning
4+
description: Learn how to submit Apache Spark jobs with Azure Machine Learning.
55
author: ynpandey
66
ms.author: yogipandey
77
ms.reviewer: franksolomon
88
ms.service: machine-learning
99
ms.subservice: mldata
1010
ms.custom: build-2023, devx-track-python
1111
ms.topic: how-to
12-
ms.date: 05/22/2023
12+
ms.date: 04/12/2024
1313
#Customer intent: As a Full Stack ML Pro, I want to submit a Spark job in Azure Machine Learning.
1414
---
1515

1616
# Configure Apache Spark jobs in Azure Machine Learning
1717

1818
[!INCLUDE [dev v2](includes/machine-learning-dev-v2.md)]
1919

20-
The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy access to distributed computing capability - backed by Azure Synapse - for scaling Apache Spark jobs on Azure Machine Learning.
20+
The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy access to distributed computing capability - backed by Azure Synapse - to scale Apache Spark jobs on Azure Machine Learning.
2121

2222
In this article, you learn how to submit a Spark job using Azure Machine Learning serverless Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough in a few simple steps.
2323

24-
For more information about **Apache Spark in Azure Machine Learning** concepts, see [this resource](./apache-spark-azure-ml-concepts.md).
24+
For more information about **Apache Spark in Azure Machine Learning** concepts, visit [this resource](./apache-spark-azure-ml-concepts.md).
2525

2626
## Prerequisites
2727

2828
# [CLI](#tab/cli)
2929
[!INCLUDE [cli v2](includes/machine-learning-cli-v2.md)]
3030
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
31-
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
32-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
31+
- An Azure Machine Learning workspace. For more information, visit [Create workspace resources](./quickstart-create-resources.md).
32+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. For more information, visit [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
3333
- [Create an Azure Machine Learning compute instance](./concept-compute-instance.md#create).
3434
- [Install Azure Machine Learning CLI](./how-to-configure-cli.md?tabs=public).
3535

3636
# [Python SDK](#tab/sdk)
3737
[!INCLUDE [sdk v2](includes/machine-learning-sdk-v2.md)]
3838
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
39-
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
40-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
39+
- An Azure Machine Learning workspace. Visit [Create workspace resources](./quickstart-create-resources.md).
40+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. Visit [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
4141
- [Configure your development environment](./how-to-configure-environment.md), or [create an Azure Machine Learning compute instance](./concept-compute-instance.md#create).
4242
- [Install Azure Machine Learning SDK for Python](/python/api/overview/azure/ai-ml-readme).
4343

@@ -50,7 +50,7 @@ For more information about **Apache Spark in Azure Machine Learning** concepts,
5050

5151
## Add role assignments in Azure storage accounts
5252

53-
Before we submit an Apache Spark job, we must ensure that input, and output, data paths are accessible. Assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity of the logged-in user to enable read and write access.
53+
Before we submit an Apache Spark job, we must ensure that the input and output data paths are accessible. Assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity of the logged-in user, to enable read and write access.
5454

5555
To assign appropriate roles to the user identity:
5656

@@ -68,16 +68,16 @@ To assign appropriate roles to the user identity:
6868

6969
:::image type="content" source="media/quickstart-spark-jobs/storage-account-add-role-assignment.png" lightbox="media/quickstart-spark-jobs/storage-account-add-role-assignment.png" alt-text="Expandable screenshot showing the Azure access keys screen.":::
7070

71-
1. Search for the role **Storage Blob Data Contributor**.
72-
1. Select the role: **Storage Blob Data Contributor**.
71+
1. Search for the **Storage Blob Data Contributor** role.
72+
1. Select the **Storage Blob Data Contributor** role.
7373
1. Select **Next**.
7474

7575
:::image type="content" source="media/quickstart-spark-jobs/add-role-assignment-choose-role.png" lightbox="media/quickstart-spark-jobs/add-role-assignment-choose-role.png" alt-text="Expandable screenshot showing the Azure add role assignment screen.":::
7676

7777
1. Select **User, group, or service principal**.
7878
1. Select **+ Select members**.
7979
1. In the textbox under **Select**, search for the user identity.
80-
1. Select the user identity from the list so that it shows under **Selected members**.
80+
1. Select the user identity from the list, so that it shows under **Selected members**.
8181
1. Select the appropriate user identity.
8282
1. Select **Next**.
8383

@@ -88,10 +88,10 @@ To assign appropriate roles to the user identity:
8888
:::image type="content" source="media/quickstart-spark-jobs/add-role-assignment-review-and-assign.png" lightbox="media/quickstart-spark-jobs/add-role-assignment-review-and-assign.png" alt-text="Expandable screenshot showing the Azure add role assignment screen review and assign tab.":::
8989
1. Repeat steps 2-13 for **Storage Blob Contributor** role assignment.
9090

91-
Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become accessible once the user identity has appropriate roles assigned.
91+
Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become accessible once the user identity has the appropriate roles assigned.
9292

9393
## Create parametrized Python code
94-
A Spark job requires a Python script that takes arguments, which can be developed by modifying the Python code developed from [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md). A sample Python script is shown here.
94+
A Spark job requires a Python script that accepts arguments. To build this script, you can modify the Python code developed from [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md). A sample Python script is shown here:
9595

9696
```python
9797
# titanic.py
@@ -120,8 +120,8 @@ df.to_csv(args.wrangled_data, index_col="PassengerId")
120120
```
121121

122122
> [!NOTE]
123-
> - This Python code sample uses `pyspark.pandas`, which is only supported by Spark runtime version 3.2.
124-
> - Please ensure that `titanic.py` file is uploaded to a folder named `src`. The `src` folder should be located in the same directory where you have created the Python script/notebook or the YAML specification file defining the standalone Spark job.
123+
> - This Python code sample uses `pyspark.pandas`, which only Spark runtime version 3.2 supports.
124+
> - Please ensure that the `titanic.py` file is uploaded to a folder named `src`. The `src` folder should be located in the same directory where you have created the Python script/notebook or the YAML specification file that defines the standalone Spark job.
125125
126126
That script takes two arguments: `--titanic_data` and `--wrangled_data`. These arguments pass the input data path, and the output folder, respectively. The script uses the `titanic.csv` file, [available here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/spark/data/titanic.csv). Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account.
127127

@@ -132,8 +132,8 @@ That script takes two arguments: `--titanic_data` and `--wrangled_data`. These a
132132

133133
> [!TIP]
134134
> You can submit a Spark job from:
135-
> - [terminal of an Azure Machine Learning compute instance](./how-to-access-terminal.md#access-a-terminal).
136-
> - terminal of [Visual Studio Code connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
135+
> - the [terminal of an Azure Machine Learning compute instance](./how-to-access-terminal.md#access-a-terminal).
136+
> - the terminal of [Visual Studio Code, connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
137137
> - your local computer that has [the Azure Machine Learning CLI](./how-to-configure-cli.md?tabs=public) installed.
138138
139139
This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
@@ -178,8 +178,8 @@ resources:
178178
```
179179
180180
In the above YAML specification file:
181-
- `code` property defines relative path of the folder containing parameterized `titanic.py` file.
182-
- `resource` property defines `instance_type` and Apache Spark `runtime_version` used by serverless Spark compute. The following instance types are currently supported:
181+
- the `code` property defines relative path of the folder containing parameterized `titanic.py` file.
182+
- the `resource` property defines the `instance_type` and the Apache Spark `runtime_version` values that serverless Spark compute uses. These instance type values are currently supported:
183183
- `standard_e4s_v3`
184184
- `standard_e8s_v3`
185185
- `standard_e16s_v3`
@@ -198,10 +198,10 @@ az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --subscription <SUBS
198198
> [!TIP]
199199
> You can submit a Spark job from:
200200
> - an Azure Machine Learning Notebook connected to an Azure Machine Learning compute instance.
201-
> - [Visual Studio Code connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
201+
> - [Visual Studio Code, connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
202202
> - your local computer that has [the Azure Machine Learning SDK for Python](/python/api/overview/azure/ai-ml-readme) installed.
203203

204-
This Python code snippet shows a standalone Spark job creation, with an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`format. Here, the `<FILE_SYSTEM_NAME>` matches the container name.
204+
This Python code snippet shows a standalone Spark job creation. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, the `<FILE_SYSTEM_NAME>` matches the container name.
205205

206206
```python
207207
from azure.ai.ml import MLClient, spark, Input, Output
@@ -253,8 +253,8 @@ ml_client.jobs.stream(returned_spark_job.name)
253253
```
254254

255255
In the above code sample:
256-
- `code` parameter defines relative path of the folder containing parameterized `titanic.py` file.
257-
- `resource` parameter defines `instance_type` and Apache Spark `runtime_version` used by serverless Spark compute (preview). The following instance types are currently supported:
256+
- the `code` parameter defines the relative path of the folder containing parameterized `titanic.py` file.
257+
- the `resource` parameter that defines the `instance_type` and the Apache Spark `runtime_version` that the serverless Spark compute (preview) uses. These instance type values are currently supported:
258258
- `Standard_E4S_V3`
259259
- `Standard_E8S_V3`
260260
- `Standard_E16S_V3`
@@ -265,57 +265,56 @@ In the above code sample:
265265

266266
[!INCLUDE [machine-learning-preview-generic-disclaimer](includes/machine-learning-preview-generic-disclaimer.md)]
267267

268-
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for workspace default datastore `workspaceblobstore`. To submit a standalone Spark job using the Azure Machine Learning studio UI:
268+
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for the workspace default `workspaceblobstore` datastore. To submit a standalone Spark job using the Azure Machine Learning studio UI:
269269

270270
1. Select **+ New**, located near the top right side of the screen.
271-
2. Select **Spark job (preview)**.
272-
3. On the **Compute** screen:
271+
1. Select **Spark job (preview)**.
272+
1. On the **Compute** screen:
273273

274274
1. Under **Select compute type**, select **Spark serverless** for serverless Spark compute.
275-
2. Select **Virtual machine size**. The following instance types are currently supported:
275+
1. Select **Virtual machine size**. These instance types are currently supported:
276276
- `Standard_E4s_v3`
277277
- `Standard_E8s_v3`
278278
- `Standard_E16s_v3`
279279
- `Standard_E32s_v3`
280280
- `Standard_E64s_v3`
281-
3. Select **Spark runtime version** as **Spark 3.2**.
282-
4. Select **Next**.
283-
4. On the **Environment** screen, select **Next**.
284-
5. On **Job settings** screen:
281+
1. Select **Spark runtime version** as **Spark 3.2**.
282+
1. Select **Next**.
283+
1. On the **Environment** screen, select **Next**.
284+
1. On the **Job settings** screen:
285285
1. Provide a job **Name**, or use the job **Name**, which is generated by default.
286-
2. Select an **Experiment name** from the dropdown menu.
287-
3. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
288-
4. Under the **Code** section:
286+
1. Select an **Experiment name** from the dropdown menu.
287+
1. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
288+
1. Under the **Code** section:
289289
1. Select **Azure Machine Learning workspace default blob storage** from **Choose code location** dropdown.
290-
2. Under **Path to code file to upload**, select **Browse**.
291-
3. In the pop-up screen titled **Path selection**, select the path of code file `titanic.py` on the workspace default datastore `workspaceblobstore`.
292-
4. Select **Save**.
293-
5. Input `titanic.py` as the name of **Entry file** for the standalone job.
294-
6. To add an input, select **+ Add input** under **Inputs** and
290+
1. Under **Path to code file to upload**, select **Browse**.
291+
1. In the pop-up screen titled **Path selection**, select the path of the `titanic.py`code file on the workspace `workspaceblobstore` default datastore.
292+
1. Select **Save**.
293+
1. Input `titanic.py` as the name of the **Entry file** for the standalone job.
294+
1. To add an input, select **+ Add input** under **Inputs** and
295295
1. Enter **Input name** as `titanic_data`. The input should refer to this name later in the **Arguments**.
296-
2. Select **Input type** as **Data**.
297-
3. Select **Data type** as **File**.
298-
4. Select **Data source** as **URI**.
299-
5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
300-
7. To add an input, select **+ Add output** under **Outputs** and
296+
1. Select **Input type** as **Data**.
297+
1. Select **Data type** as **File**.
298+
1. Select **Data source** as **URI**.
299+
1. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
300+
1. To add an input, select **+ Add output** under **Outputs** and
301301
1. Enter **Output name** as `wrangled_data`. The output should refer to this name later in the **Arguments**.
302-
2. Select **Output type** as **Folder**.
303-
3. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here `<FILE_SYSTEM_NAME>` matches the container name.
304-
8. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
305-
5. Under the **Spark configurations** section:
302+
1. Select **Output type** as **Folder**.
303+
1. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
304+
1. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
305+
1. Under the **Spark configurations** section:
306306
1. For **Executor size**:
307307
1. Enter the number of executor **Cores** as 2 and executor **Memory (GB)** as 2.
308-
2. For **Dynamically allocated executors**, select **Disabled**.
309-
3. Enter the number of **Executor instances** as 2.
310-
2. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
311-
6. Select **Next**.
312-
6. On the **Review** screen:
308+
1. For **Dynamically allocated executors**, select **Disabled**.
309+
1. Enter the number of **Executor instances** as 2.
310+
1. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
311+
1. Select **Next**.
312+
1. On the **Review** screen:
313313
1. Review the job specification before submitting it.
314-
2. Select **Create** to submit the standalone Spark job.
314+
1. Select **Create** to submit the standalone Spark job.
315315

316316
> [!NOTE]
317-
> A standalone job submitted from the Studio UI using an Azure Machine Learning serverless Spark compute defaults to user identity passthrough for data access.
318-
317+
> A standalone job submitted from the Studio UI, using an Azure Machine Learning serverless Spark compute, defaults to the user identity passthrough for data access.
319318

320319
---
321320

@@ -329,4 +328,4 @@ First, upload the parameterized Python code `titanic.py` to the Azure Blob stora
329328
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
330329
- [Submit Spark jobs in Azure Machine Learning](./how-to-submit-spark-jobs.md)
331330
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
332-
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
331+
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)

0 commit comments

Comments
 (0)