You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy access to distributed computing capability - backed by Azure Synapse - for scaling Apache Spark jobs on Azure Machine Learning.
20
+
The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy access to distributed computing capability - backed by Azure Synapse - to scale Apache Spark jobs on Azure Machine Learning.
21
21
22
22
In this article, you learn how to submit a Spark job using Azure Machine Learning serverless Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough in a few simple steps.
23
23
24
-
For more information about **Apache Spark in Azure Machine Learning** concepts, see[this resource](./apache-spark-azure-ml-concepts.md).
24
+
For more information about **Apache Spark in Azure Machine Learning** concepts, visit[this resource](./apache-spark-azure-ml-concepts.md).
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
31
-
- An Azure Machine Learning workspace. See[Create workspace resources](./quickstart-create-resources.md).
32
-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See[Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
31
+
- An Azure Machine Learning workspace. For more information, visit[Create workspace resources](./quickstart-create-resources.md).
32
+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. For more information, visit[Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
33
33
-[Create an Azure Machine Learning compute instance](./concept-compute-instance.md#create).
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
39
-
- An Azure Machine Learning workspace. See[Create workspace resources](./quickstart-create-resources.md).
40
-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See[Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
39
+
- An Azure Machine Learning workspace. Visit[Create workspace resources](./quickstart-create-resources.md).
40
+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. Visit[Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
41
41
-[Configure your development environment](./how-to-configure-environment.md), or [create an Azure Machine Learning compute instance](./concept-compute-instance.md#create).
42
42
-[Install Azure Machine Learning SDK for Python](/python/api/overview/azure/ai-ml-readme).
43
43
@@ -50,7 +50,7 @@ For more information about **Apache Spark in Azure Machine Learning** concepts,
50
50
51
51
## Add role assignments in Azure storage accounts
52
52
53
-
Before we submit an Apache Spark job, we must ensure that input, and output, data paths are accessible. Assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity of the logged-in user to enable read and write access.
53
+
Before we submit an Apache Spark job, we must ensure that the input and output data paths are accessible. Assign **Contributor** and **Storage Blob Data Contributor** roles to the user identity of the logged-in user, to enable read and write access.
54
54
55
55
To assign appropriate roles to the user identity:
56
56
@@ -68,16 +68,16 @@ To assign appropriate roles to the user identity:
1. Search for the role **Storage Blob Data Contributor**.
72
-
1. Select the role: **Storage Blob Data Contributor**.
71
+
1. Search for the **Storage Blob Data Contributor** role.
72
+
1. Select the **Storage Blob Data Contributor** role.
73
73
1. Select **Next**.
74
74
75
75
:::image type="content" source="media/quickstart-spark-jobs/add-role-assignment-choose-role.png" lightbox="media/quickstart-spark-jobs/add-role-assignment-choose-role.png" alt-text="Expandable screenshot showing the Azure add role assignment screen.":::
76
76
77
77
1. Select **User, group, or service principal**.
78
78
1. Select **+ Select members**.
79
79
1. In the textbox under **Select**, search for the user identity.
80
-
1. Select the user identity from the list so that it shows under **Selected members**.
80
+
1. Select the user identity from the list, so that it shows under **Selected members**.
81
81
1. Select the appropriate user identity.
82
82
1. Select **Next**.
83
83
@@ -88,10 +88,10 @@ To assign appropriate roles to the user identity:
88
88
:::image type="content" source="media/quickstart-spark-jobs/add-role-assignment-review-and-assign.png" lightbox="media/quickstart-spark-jobs/add-role-assignment-review-and-assign.png" alt-text="Expandable screenshot showing the Azure add role assignment screen review and assign tab.":::
89
89
1. Repeat steps 2-13 for **Storage Blob Contributor** role assignment.
90
90
91
-
Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become accessible once the user identity has appropriate roles assigned.
91
+
Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become accessible once the user identity has the appropriate roles assigned.
92
92
93
93
## Create parametrized Python code
94
-
A Spark job requires a Python script that takes arguments, which can be developed by modifying the Python code developed from [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md). A sample Python script is shown here.
94
+
A Spark job requires a Python script that accepts arguments. To build this script, you can modify the Python code developed from [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md). A sample Python script is shown here:
> - This Python code sample uses `pyspark.pandas`, which is only supported by Spark runtime version 3.2.
124
-
> - Please ensure that `titanic.py` file is uploaded to a folder named `src`. The `src` folder should be located in the same directory where you have created the Python script/notebook or the YAML specification file defining the standalone Spark job.
123
+
> - This Python code sample uses `pyspark.pandas`, which only Spark runtime version 3.2 supports.
124
+
> - Please ensure that the `titanic.py` file is uploaded to a folder named `src`. The `src` folder should be located in the same directory where you have created the Python script/notebook or the YAML specification file that defines the standalone Spark job.
125
125
126
126
That script takes two arguments: `--titanic_data` and `--wrangled_data`. These arguments pass the input data path, and the output folder, respectively. The script uses the `titanic.csv` file, [available here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/spark/data/titanic.csv). Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account.
127
127
@@ -132,8 +132,8 @@ That script takes two arguments: `--titanic_data` and `--wrangled_data`. These a
132
132
133
133
> [!TIP]
134
134
> You can submit a Spark job from:
135
-
> -[terminal of an Azure Machine Learning compute instance](./how-to-access-terminal.md#access-a-terminal).
136
-
> - terminal of [Visual Studio Code connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
135
+
> -the [terminal of an Azure Machine Learning compute instance](./how-to-access-terminal.md#access-a-terminal).
136
+
> -the terminal of [Visual Studio Code, connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
137
137
> - your local computer that has [the Azure Machine Learning CLI](./how-to-configure-cli.md?tabs=public) installed.
138
138
139
139
This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
@@ -178,8 +178,8 @@ resources:
178
178
```
179
179
180
180
In the above YAML specification file:
181
-
- `code` property defines relative path of the folder containing parameterized `titanic.py` file.
182
-
- `resource` property defines `instance_type` and Apache Spark `runtime_version` used by serverless Spark compute. The following instance types are currently supported:
181
+
- the `code` property defines relative path of the folder containing parameterized `titanic.py` file.
182
+
- the `resource` property defines the `instance_type` and the Apache Spark `runtime_version` values that serverless Spark compute uses. These instance type values are currently supported:
183
183
- `standard_e4s_v3`
184
184
- `standard_e8s_v3`
185
185
- `standard_e16s_v3`
@@ -198,10 +198,10 @@ az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --subscription <SUBS
198
198
> [!TIP]
199
199
> You can submit a Spark job from:
200
200
> - an Azure Machine Learning Notebook connected to an Azure Machine Learning compute instance.
201
-
> - [Visual Studio Code connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
201
+
> - [Visual Studio Code, connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
202
202
> - your local computer that has [the Azure Machine Learning SDK for Python](/python/api/overview/azure/ai-ml-readme) installed.
203
203
204
-
This Python code snippet shows a standalone Spark job creation, with an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`format. Here, the `<FILE_SYSTEM_NAME>` matches the container name.
204
+
This Python code snippet shows a standalone Spark job creation. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`format. Here, the `<FILE_SYSTEM_NAME>` matches the container name.
205
205
206
206
```python
207
207
from azure.ai.ml import MLClient, spark, Input, Output
- `code`parameter defines relative path of the folder containing parameterized `titanic.py` file.
257
-
- `resource` parameter defines `instance_type` and Apache Spark `runtime_version` used by serverless Spark compute (preview). The following instance types are currently supported:
256
+
- the `code` parameter defines the relative path of the folder containing parameterized `titanic.py` file.
257
+
- the `resource` parameter that defines the `instance_type` and the Apache Spark `runtime_version` that the serverless Spark compute (preview) uses. These instance type values are currently supported:
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for workspace default datastore `workspaceblobstore`. To submit a standalone Spark job using the Azure Machine Learning studio UI:
268
+
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for the workspace default `workspaceblobstore` datastore. To submit a standalone Spark job using the Azure Machine Learning studio UI:
269
269
270
270
1. Select **+ New**, located near the top right side of the screen.
271
-
2. Select **Spark job (preview)**.
272
-
3. On the **Compute** screen:
271
+
1. Select **Spark job (preview)**.
272
+
1. On the **Compute** screen:
273
273
274
274
1. Under **Select compute type**, select **Spark serverless** for serverless Spark compute.
275
-
2. Select **Virtual machine size**. The following instance types are currently supported:
275
+
1. Select **Virtual machine size**. These instance types are currently supported:
276
276
- `Standard_E4s_v3`
277
277
- `Standard_E8s_v3`
278
278
- `Standard_E16s_v3`
279
279
- `Standard_E32s_v3`
280
280
- `Standard_E64s_v3`
281
-
3. Select **Spark runtime version** as **Spark 3.2**.
282
-
4. Select **Next**.
283
-
4. On the **Environment** screen, select **Next**.
284
-
5. On **Job settings** screen:
281
+
1. Select **Spark runtime version** as **Spark 3.2**.
282
+
1. Select **Next**.
283
+
1. On the **Environment** screen, select **Next**.
284
+
1. On the **Job settings** screen:
285
285
1. Provide a job **Name**, or use the job **Name**, which is generated by default.
286
-
2. Select an **Experiment name** from the dropdown menu.
287
-
3. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
288
-
4. Under the **Code** section:
286
+
1. Select an **Experiment name** from the dropdown menu.
287
+
1. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
2. Under **Path to code file to upload**, select **Browse**.
291
-
3. In the pop-up screen titled **Path selection**, select the path of code file `titanic.py`on the workspace default datastore `workspaceblobstore`.
292
-
4. Select **Save**.
293
-
5. Input `titanic.py` as the name of **Entry file** for the standalone job.
294
-
6. To add an input, select **+ Add input** under **Inputs** and
290
+
1. Under **Path to code file to upload**, select **Browse**.
291
+
1. In the pop-up screen titled **Path selection**, select the path of the `titanic.py`code file on the workspace `workspaceblobstore` default datastore.
292
+
1. Select **Save**.
293
+
1. Input `titanic.py` as the name of the **Entry file** for the standalone job.
294
+
1. To add an input, select **+ Add input** under **Inputs** and
295
295
1. Enter **Input name** as `titanic_data`. The input should refer to this name later in the **Arguments**.
296
-
2. Select **Input type** as **Data**.
297
-
3. Select **Data type** as **File**.
298
-
4. Select **Data source** as **URI**.
299
-
5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
300
-
7. To add an input, select **+ Add output** under **Outputs** and
296
+
1. Select **Input type** as **Data**.
297
+
1. Select **Data type** as **File**.
298
+
1. Select **Data source** as **URI**.
299
+
1. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
300
+
1. To add an input, select **+ Add output** under **Outputs** and
301
301
1. Enter **Output name** as `wrangled_data`. The output should refer to this name later in the **Arguments**.
302
-
2. Select **Output type** as **Folder**.
303
-
3. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here `<FILE_SYSTEM_NAME>` matches the container name.
304
-
8. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
305
-
5. Under the **Spark configurations** section:
302
+
1. Select **Output type** as **Folder**.
303
+
1. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
304
+
1. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
305
+
1. Under the **Spark configurations** section:
306
306
1. For **Executor size**:
307
307
1. Enter the number of executor **Cores** as 2 and executor **Memory (GB)** as 2.
308
-
2. For **Dynamically allocated executors**, select **Disabled**.
309
-
3. Enter the number of **Executor instances** as 2.
310
-
2. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
311
-
6. Select **Next**.
312
-
6. On the **Review** screen:
308
+
1. For **Dynamically allocated executors**, select **Disabled**.
309
+
1. Enter the number of **Executor instances** as 2.
310
+
1. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
311
+
1. Select **Next**.
312
+
1. On the **Review** screen:
313
313
1. Review the job specification before submitting it.
314
-
2. Select **Create** to submit the standalone Spark job.
314
+
1. Select **Create** to submit the standalone Spark job.
315
315
316
316
> [!NOTE]
317
-
> A standalone job submitted from the Studio UI using an Azure Machine Learning serverless Spark compute defaults to user identity passthrough for data access.
318
-
317
+
> A standalone job submitted from the Studio UI, using an Azure Machine Learning serverless Spark compute, defaults to the user identity passthrough for data access.
319
318
320
319
---
321
320
@@ -329,4 +328,4 @@ First, upload the parameterized Python code `titanic.py` to the Azure Blob stora
329
328
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
330
329
- [Submit Spark jobs in Azure Machine Learning](./how-to-submit-spark-jobs.md)
331
330
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
332
-
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
331
+
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
0 commit comments