You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Azure Machine Learning provides the ability to submit standalone machine learning jobs or creating a [machine learning pipeline](/concept-ml-pipelines.md) comprising multiple steps in a machine learning workflow. Azure Machine Learning supports creation of a standalone Spark job, and creation of a reusable Spark component that can be used in Azure Machine Learning pipelines. In this article you will learn how to submit Spark jobs using:
19
+
Azure Machine Learning provides the ability to submit standalone machine learning jobs or creating a [machine learning pipeline](./concept-ml-pipelines.md) comprising multiple steps in a machine learning workflow. Azure Machine Learning supports creation of a standalone Spark job, and creation of a reusable Spark component that can be used in Azure Machine Learning pipelines. In this article you will learn how to submit Spark jobs using:
20
20
- Azure Machine Learning studio UI
21
21
- Azure Machine Learning CLI
22
22
- Azure Machine Learning SDK
23
23
24
24
## Prerequisites
25
25
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin
26
26
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md)
27
-
-[An attached Synapse Spark pool in the Azure Machine Learning workspace](/how-to-manage-synapse-spark-pool.md).
27
+
-[An attached Synapse Spark pool in the Azure Machine Learning workspace](./how-to-manage-synapse-spark-pool.md).
28
28
-[Configure your development environment](./how-to-configure-environment.md), or [create an Azure Machine Learning compute instance](./concept-compute-instance.md#create)
29
29
-[Install the Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/installv2)
Spark jobs can use either user identity passthrough or a managed identity to access data and other resource. Different mechanisms for accessing resources while using attached Synapse Spark pool and Managed (Automatic) Spark compute are summarized in the following table.
33
+
Spark jobs can use either user identity passthrough or a managed identity to access data and other resource. Different mechanisms for accessing resources while using attached Synapse Spark pool and Managed (Automatic) Spark compute are summarized in the following table.
> To ensure successful execution of spark job, the identity being used for the Spark job should be assigned **Contributor** and **Storage Blob Data Contributor** roles on the Azure storage account used for data input and output.
67
67
68
68
## Submit a standalone Spark job
69
-
Once a Python script is developed by [interactive data wrangling](/interactive-data-wrangling-with-apache-spark-azure-ml.md), it can be used for submitting a batch job to process a larger volume of data after making necessary changes for parameterization of the Python script. A simple data wrangling batch job can be submitted as a standalone Spark job.
69
+
Once a Python script is developed by [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md), it can be used for submitting a batch job to process a larger volume of data after making necessary changes for parameterization of the Python script. A simple data wrangling batch job can be submitted as a standalone Spark job.
70
70
71
-
A Spark job requires a Python script that takes arguments, which can be developed by modifying the Python code developed from [interactive data wrangling](/interactive-data-wrangling-with-apache-spark-azure-ml.md). A sample Python script is shown here.
71
+
A Spark job requires a Python script that takes arguments, which can be developed by modifying the Python code developed from [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md). A sample Python script is shown here.
72
72
73
73
```python
74
74
@@ -126,7 +126,7 @@ A standalone Spark job can be defined as a YAML specification file, which can be
126
126
-`spark.dynamicAllocation.maxExecutors` - the maximum number of Spark executors instances, for dynamic allocation.
127
127
- If dynamic allocation of executors is disabled, define this property:
128
128
-`spark.executor.instances` - the number of Spark executor instances.
129
-
-`environment` - an [Azure Machine Learning environment](/reference-yaml-environment) to run the job.
129
+
-`environment` - an [Azure Machine Learning environment](./reference-yaml-environment.md) to run the job.
130
130
-`args` - the command line arguments that should be passed to the job entry point Python script or class. See the YAML specification file provided below for an example.
131
131
-`compute` - this property defines the name of an attached Synapse Spark pool, as shown in this example:
132
132
```yaml
@@ -429,7 +429,7 @@ To submit a standalone Spark job using the Azure Machine Learning studio UI:
429
429
1. Select **Create** to submit the standalone Spark job.
430
430
431
431
## Spark component in a pipeline job
432
-
A Spark component allows the flexibility to use the same component in multiple [Azure Machine Learning pipelines](/concept-ml-pipelines) as a pipeline step.
432
+
A Spark component allows the flexibility to use the same component in multiple [Azure Machine Learning pipelines](./concept-ml-pipelines.md) as a pipeline step.
433
433
434
434
# [Azure CLI](#tab/cli)
435
435
@@ -501,7 +501,7 @@ conf:
501
501
502
502
```
503
503
504
-
The Spark component defined in the above YAML specification file can be used in an Azure Machine Learning pipeline job. See [pipeline job YAML schema](/reference-yaml-job-pipeline.md) to learn more about the YAML syntax that defines a pipeline job. This is an example YAML specification file for a pipeline job, with a Spark component:
504
+
The Spark component defined in the above YAML specification file can be used in an Azure Machine Learning pipeline job. See [pipeline job YAML schema](./reference-yaml-job-pipeline.md) to learn more about the YAML syntax that defines a pipeline job. This is an example YAML specification file for a pipeline job, with a Spark component:
Copy file name to clipboardExpand all lines: articles/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -389,5 +389,5 @@ df.head()
389
389
390
390
- [Code samples for interactive data wrangling with Apache Spark in Azure Machine Learning](https://github.com/Azure/azureml-examples/tree/main/sdk/python/data-wrangling)
391
391
- [Optimize Apache Spark jobs in Azure Synapse Analytics](../synapse-analytics/spark/apache-spark-performance.md)
392
-
- [What are Azure Machine Learning pipelines?](/concept-ml-pipelines.md)
393
-
- [Submit Spark jobs in Azure Machine Learning (preview)](/how-to-submit-spark-jobs.md)
392
+
- [What are Azure Machine Learning pipelines?](./concept-ml-pipelines.md)
393
+
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)
0 commit comments