You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/apache-spark-azure-ml-concepts.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,7 @@ To access data and other resources, a Spark job can use either a user identity p
107
107
|Managed (Automatic) Spark compute|User identity and managed identity|User identity|
108
108
|Attached Synapse Spark pool|User identity and managed identity|Managed identity - compute identity of the attached Synapse Spark pool|
109
109
110
-
[This article](./how-to-submit-spark-jobs.md#ensuring-resource-access-for-spark-jobs) describes resource access for Spark jobs. In a notebook session, both the Managed (Automatic) Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
110
+
[This article](./apache-spark-environment-configuration.md#ensuring-resource-access-for-spark-jobs) describes resource access for Spark jobs. In a notebook session, both the Managed (Automatic) Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
111
111
112
112
> [!NOTE]
113
113
> - To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles (on the Azure storage account used for data input and output) to the identity that will be used for the Spark job submission.
Copy file name to clipboardExpand all lines: articles/machine-learning/apache-spark-environment-configuration.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,7 +111,19 @@ Once the user identity has the appropriate roles assigned, data in the Azure sto
111
111
> [!NOTE]
112
112
> If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace that has a managed virtual network associated with it, [a managed private endpoint to storage account should be configured](../synapse-analytics/security/connect-to-a-secure-storage-account.md) to ensure data access.
113
113
114
+
## Ensuring resource access for Spark jobs
115
+
116
+
Spark jobs can use either a managed identity or user identity passthrough to access data and other resources. The following table summarizes the different mechanisms for resource access while using Azure Machine Learning Managed (Automatic) Spark compute and attached Synapse Spark pool.
|Managed (Automatic) Spark compute|User identity and managed identity|User identity|
121
+
|Attached Synapse Spark pool|User identity and managed identity|Managed identity - compute identity of the attached Synapse Spark pool|
122
+
123
+
If the CLI or SDK code defines an option to use managed identity, Azure Machine Learning Managed (Automatic) Spark compute relies on a user-assigned managed identity attached to the workspace. You can attach a user-assigned managed identity to an existing Azure Machine Learning workspace using Azure Machine Learning CLI v2, or with `ARMClient`.
124
+
114
125
## Next steps
126
+
115
127
-[Apache Spark in Azure Machine Learning (preview)](./apache-spark-azure-ml-concepts.md)
116
128
-[Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
117
129
-[Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
Azure Machine Learning supports submission of standalone machine learning jobs, and creation of [machine learning pipelines](./concept-ml-pipelines.md), that involve multiple machine learning workflow steps. Azure Machine Learning handles both standalone Spark job creation, and creation of reusable Spark components that Azure Machine Learning pipelines can use. In this article, you'll learn how to submit Spark jobs using:
20
-
- Azure Machine Learning Studio UI
19
+
Azure Machine Learning supports submission of standalone machine learning jobs and creation of [machine learning pipelines](./concept-ml-pipelines.md) that involve multiple machine learning workflow steps. Azure Machine Learning handles both standalone Spark job creation, and creation of reusable Spark components that Azure Machine Learning pipelines can use. In this article, you'll learn how to submit Spark jobs using:
20
+
- Azure Machine Learning studio UI
21
21
- Azure Machine Learning CLI
22
22
- Azure Machine Learning SDK
23
23
24
-
See [this resource](./apache-spark-azure-ml-concepts.md) for more information about **Apache Spark in Azure Machine Learning** concepts.
24
+
For more information about **Apache Spark in Azure Machine Learning** concepts, see [this resource](./apache-spark-azure-ml-concepts.md).
25
25
26
26
## Prerequisites
27
27
@@ -42,29 +42,23 @@ See [this resource](./apache-spark-azure-ml-concepts.md) for more information ab
42
42
-[(Optional): An attached Synapse Spark pool in the Azure Machine Learning workspace](./how-to-manage-synapse-spark-pool.md).
43
43
44
44
# [Studio UI](#tab/ui)
45
-
These prerequisites cover the submission of a Spark job from Azure Machine Learning Studio UI:
45
+
These prerequisites cover the submission of a Spark job from Azure Machine Learning studio UI:
46
46
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
47
47
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
48
48
- To enable this feature:
49
-
1. Navigate to Azure Machine Learning Studio UI.
49
+
1. Navigate to Azure Machine Learning studio UI.
50
50
2. Select **Manage preview features** (megaphone icon) from the icons on the top right side of the screen.
51
51
3. In **Managed preview feature** panel, toggle on **Run notebooks and jobs on managed Spark** feature.
-[(Optional): An attached Synapse Spark pool in the Azure Machine Learning workspace](./how-to-manage-synapse-spark-pool.md).
54
54
55
55
---
56
56
57
-
## Ensuring resource access for Spark jobs
58
-
Spark jobs can use either user identity passthrough, or a managed identity, to access data and other resources. The following table summarizes the different mechanisms for resource access while using Azure Machine Learning Managed (Automatic) Spark compute and attached Synapse Spark pool.
|Managed (Automatic) Spark compute|User identity and managed identity|User identity|
63
-
|Attached Synapse Spark pool|User identity and managed identity|Managed identity - compute identity of the attached Synapse Spark pool|
64
-
65
-
If the CLI or SDK code defines an option to use managed identity, Azure Machine Learning Managed (Automatic) Spark compute uses user-assigned managed identity attached to the workspace. You can attach a user-assigned managed identity to an existing Azure Machine Learning workspace using Azure Machine Learning CLI v2, or with `ARMClient`.
57
+
> [!NOTE]
58
+
> To learn more about resource access while using Azure Machine Learning Managed (Automatic) Spark compute, and attached Synapse Spark pool, see [Ensuring resource access for Spark jobs](apache-spark-environment-configuration.md#ensuring-resource-access-for-spark-jobs).
66
59
67
60
### Attach user assigned managed identity using CLI v2
61
+
68
62
1. Create a YAML file that defines the user-assigned managed identity that should be attached to the workspace:
69
63
```yaml
70
64
identity:
@@ -80,6 +74,7 @@ If the CLI or SDK code defines an option to use managed identity, Azure Machine
80
74
```
81
75
82
76
### Attach user assigned managed identity using `ARMClient`
77
+
83
78
1. Install [ARMClient](https://github.com/projectkudu/ARMClient), a simple command line tool that invokes the Azure Resource Manager API.
84
79
1. Create a JSON file that defines the user-assigned managed identity that should be attached to the workspace:
85
80
```json
@@ -146,6 +141,7 @@ The above script takes two arguments `--titanic_data` and `--wrangled_data`, whi
146
141
To create a job, a standalone Spark job can be defined as a YAML specification file, which can be used in the `az ml job create` command, with the `--file` parameter. Define these properties in the YAML file as follows:
147
142
148
143
### YAML properties in the Spark job specification
144
+
149
145
- `type`- set to `spark`.
150
146
- `code`- defines the location of the folder that contains source code and scripts for this job.
151
147
- `entry` - defines the entry point for the job. It should cover one of these properties:
@@ -222,9 +218,10 @@ To create a job, a standalone Spark job can be defined as a YAML specification f
- `identity`- this optional property defines the identity used to submit this job. It can have `user_identity` and `managed` values. If no identity is defined in the YAML specification, the default identity will be used.
226
-
221
+
- `identity`- this optional property defines the identity used to submit this job. It can have `user_identity` and `managed` values. If no identity is defined in the YAML specification, the Spark job will use the default identity.
222
+
227
223
### Standalone Spark job
224
+
228
225
This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning Managed (Automatic) Spark compute:
229
226
230
227
```yaml
@@ -304,7 +301,7 @@ To create a standalone Spark job, use the `azure.ai.ml.spark` function, with the
304
301
- `dynamic_allocation_max_executors`- the maximum number of Spark executors instances for dynamic allocation.
305
302
- If dynamic allocation of executors is disabled, then define these parameters:
306
303
- `executor_instances`- the number of Spark executor instances.
307
-
- `environment` - the Azure Machine Learning environment that will run the job. This parameter should pass:
304
+
- `environment` - the Azure Machine Learning environment that runs the job. This parameter should pass:
308
305
- an object of `azure.ai.ml.entities.Environment`, or an Azure Machine Learning environment name (string).
309
306
- `args`- the command line arguments that should be passed to the job entry point Python script or class. See the sample code provided here for an example.
310
307
- `resources` - the resources to be used by an Azure Machine Learning Managed (Automatic) Spark compute. This parameter should pass a dictionary with:
@@ -336,7 +333,7 @@ To create a standalone Spark job, use the `azure.ai.ml.spark` function, with the
### Submit a standalone Spark job from Azure Machine Learning Studio UI
403
-
To submit a standalone Spark job using the Azure Machine Learning Studio UI:
399
+
### Submit a standalone Spark job from Azure Machine Learning studio UI
404
400
405
-
:::image type="content" source="media/how-to-submit-spark-jobs/create_standalone_spark_job.png" alt-text="Screenshot showing creation of a new Spark job in Azure Machine Learning Studio UI.":::
401
+
To submit a standalone Spark job using the Azure Machine Learning studio UI:
402
+
403
+
:::image type="content" source="media/how-to-submit-spark-jobs/create_standalone_spark_job.png" alt-text="Screenshot showing creation of a new Spark job in Azure Machine Learning studio UI.":::
406
404
407
405
- In the left pane, select **+ New**.
408
406
- Select **Spark job (preview)**.
409
407
- On the **Compute** screen:
410
408
411
-
:::image type="content" source="media/how-to-submit-spark-jobs/create_standalone_spark_job_compute.png" alt-text="Screenshot showing compute selection screen for a new Spark job in Azure Machine Learning Studio UI.":::
409
+
:::image type="content" source="media/how-to-submit-spark-jobs/create_standalone_spark_job_compute.png" alt-text="Screenshot showing compute selection screen for a new Spark job in Azure Machine Learning studio UI.":::
412
410
413
411
1. Under **Select compute type**, select **Spark automatic compute (Preview)** for Managed (Automatic) Spark compute, or **Attached compute** for an attached Synapse Spark pool.
414
412
1. If you selected **Spark automatic compute (Preview)**:
@@ -486,6 +484,7 @@ To submit a standalone Spark job using the Azure Machine Learning Studio UI:
486
484
---
487
485
488
486
## Spark component in a pipeline job
487
+
489
488
A Spark component offers the flexibility to use the same component in multiple [Azure Machine Learning pipelines](./concept-ml-pipelines.md), as a pipeline step.
490
489
491
490
# [Azure CLI](#tab/cli)
@@ -606,7 +605,7 @@ You can execute the above command from:
606
605
To create an Azure Machine Learning pipeline with a Spark component, you should have familiarity with creation of [Azure Machine Learning pipelines from components, using Python SDK](./tutorial-pipeline-python-sdk.md#create-the-pipeline-from-components). A Spark component is created using `azure.ai.ml.spark` function. The function parameters are defined almost the same way as for the [standalone Spark job](#standalone-spark-job-using-python-sdk). These parameters are defined differently for the Spark component:
607
606
608
607
- `name`- the name of the Spark component.
609
-
- `display_name`- the name of the Spark component that will display in the UI and elsewhere.
608
+
- `display_name`- the name of the Spark component displayed in the UI and elsewhere.
610
609
- `inputs`- this parameter is similar to `inputs` parameter described for the [standalone Spark job](#standalone-spark-job-using-python-sdk), except that the `azure.ai.ml.Input` class is instantiated without the `path` parameter.
611
610
- `outputs`- this parameter is similar to `outputs` parameter described for the [standalone Spark job](#standalone-spark-job-using-python-sdk), except that the `azure.ai.ml.Output` class is instantiated without the `path` parameter.
612
611
@@ -695,5 +694,6 @@ This functionality isn't available in the Studio UI. The Studio UI doesn't suppo
695
694
---
696
695
697
696
## Next steps
697
+
698
698
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
699
-
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
699
+
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
0 commit comments