Skip to content

Commit 095b7fe

Browse files
authored
Merge pull request #226135 from ShawnJackson/apache-spark-azure-ml-concepts
[AQ] edit pass: apache-spark-azure-ml-concepts
2 parents f833722 + 1984bd9 commit 095b7fe

File tree

1 file changed

+63
-50
lines changed

1 file changed

+63
-50
lines changed
Lines changed: 63 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Apache Spark in Azure Machine Learning (preview)"
33
titleSuffix: Azure Machine Learning
4-
description: This article explains difference options for accessing Apache Spark in Azure Machine Learning.
4+
description: This article explains the options for accessing Apache Spark in Azure Machine Learning.
55
services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: mldata
@@ -11,94 +11,107 @@ author: ynpandey
1111
ms.reviewer: franksolomon
1212
ms.date: 01/30/2023
1313
ms.custom: cliv2, sdkv2
14-
#Customer intent: As a Full Stack ML Pro, I want to use Apache Spark in Azure Machine Learning.
14+
#Customer intent: As a full-stack machine learning pro, I want to use Apache Spark in Azure Machine Learning.
1515
---
1616

1717
# Apache Spark in Azure Machine Learning (preview)
18-
The Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to distributed computing, using the Apache Spark framework. This integration offers these Apache Spark computing experiences:
18+
19+
Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to distributed computing through the Apache Spark framework. This integration offers these Apache Spark computing experiences:
20+
1921
- Managed (Automatic) Spark compute
2022
- Attached Synapse Spark pool
2123

2224
## Managed (Automatic) Spark compute
23-
Azure Machine Learning Managed (Automatic) Spark compute is the easiest way to execute distributed computing tasks in the Azure Machine Learning environment, using the Apache Spark framework. Azure Machine Learning users can use a fully managed, serverless, on-demand Apache Spark compute cluster. Those users can avoid the need to create an Azure Synapse Workspace and an Azure Synapse Spark pool. Users can define the resources, including
2425

25-
- instance type
26-
- Apache Spark runtime version
26+
Azure Machine Learning Managed (Automatic) Spark compute is the easiest way to accomplish distributed computing tasks in the Azure Machine Learning environment by using the Apache Spark framework. Azure Machine Learning users can use a fully managed, serverless, on-demand Apache Spark compute cluster. Those users can avoid the need to create an Azure Synapse workspace and a Synapse Spark pool.
2727

28-
to access the Managed (Automatic) Spark compute in Azure Machine Learning Notebooks, for
28+
Users can define resources, including instance type and Apache Spark runtime version. They can then use those resources to access Managed (Automatic) Spark compute in Azure Machine Learning notebooks for:
2929

30-
- [interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
30+
- [Interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
3131
- [Spark batch job submissions](./how-to-submit-spark-jobs.md)
32-
- [running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
32+
- [Running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
3333

34-
### Some points to consider
35-
Managed (Automatic) Spark compute works well for most user scenarios that require quick access to distributed computing using Apache Spark. To make an informed decision, however, users should consider the advantages and disadvantages of this approach.
34+
### Points to consider
3635

37-
### Advantages
36+
Managed (Automatic) Spark compute works well for most user scenarios that require quick access to distributed computing through Apache Spark. But to make an informed decision, users should consider the advantages and disadvantages of this approach.
37+
38+
Advantages:
3839

39-
- No dependencies on other Azure resources to be created for Apache Spark
40-
- No permissions required in the subscription to create Synapse-related resources
41-
- No need for SQL pool quota
42-
43-
### Disadvantages
44-
45-
- Persistent Hive metastore is missing. Therefore, Managed (Automatic) Spark compute only supports in-memory Spark SQL
46-
- No available tables or databases
47-
- Missing Purview integration
48-
- Linked Services not available
49-
- Fewer Data sources/connectors
50-
- Missing pool-level configuration
51-
- Missing pool-level library management
52-
- Partial support for `mssparkutils`
40+
- There are no dependencies on other Azure resources to be created for Apache Spark.
41+
- No permissions are required in the subscription to create Azure Synapse-related resources.
42+
- There's no need for SQL pool quotas.
43+
44+
Disadvantages:
45+
46+
- A persistent Hive metastore is missing. Managed (Automatic) Spark compute supports only in-memory Spark SQL.
47+
- No tables or databases are available.
48+
- Azure Purview integration is missing.
49+
- Linked services aren't available.
50+
- There are fewer data sources and connectors.
51+
- Pool-level configuration is missing.
52+
- Pool-level library management is missing.
53+
- There's only partial support for `mssparkutils`.
5354

5455
### Network configuration
55-
As of January 2023, the Managed (Automatic) Spark compute doesn't support managed VNet or private endpoint creation to Azure Synapse.
5656

57-
### Inactivity periods and tear down mechanism
58-
A Managed (Automatic) Spark compute (**cold start**) resource might need three to five minutes to start the Spark session, when first launched. The automated Managed (Automatic) Spark compute provisioning, backed by Azure Synapse, causes this delay. Once the Managed (Automatic) Spark compute is provisioned, and an Apache Spark session starts, subsequent code executions (**warm start**) won't experience this delay. The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will terminate after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following 10 minutes, resources provisioned for the Managed (Automatic) Spark compute will be torn down. Once the Managed (Automatic) Spark compute resource tear-down happens, submission of the next job will require a *cold start*. The next visualization shows some session inactivity period and cluster teardown scenarios.
57+
As of January 2023, creating a Managed (Automatic) Spark compute inside a virtual network and creating a private endpoint to Azure Synapse are not supported.
58+
59+
### Inactivity periods and tear-down mechanism
60+
61+
A Managed (Automatic) Spark compute (*cold start*) resource might need three to five minutes to start the Spark session when it's first launched. The automated Managed (Automatic) Spark compute provisioning, backed by Azure Synapse, causes this delay. After the Managed (Automatic) Spark compute is provisioned and an Apache Spark session starts, subsequent code executions (*warm start*) won't experience this delay.
5962

60-
:::image type="content" source="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" lightbox="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" alt-text="Expandable screenshot that shows different scenarios for Apache spark session inactivity period and cluster teardown.":::
63+
The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will end after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following 10 minutes, resources provisioned for the Managed (Automatic) Spark compute will be torn down.
64+
65+
After the Managed (Automatic) Spark compute resource tear-down happens, submission of the next job will require a *cold start*. The next visualization shows some session inactivity period and cluster teardown scenarios.
66+
67+
:::image type="content" source="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" lightbox="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" alt-text="Expandable diagram that shows scenarios for Apache Spark session inactivity period and cluster teardown.":::
6168

6269
## Attached Synapse Spark pool
63-
A Synapse Spark pool created in an Azure Synapse workspace becomes available in the Azure Machine Learning workspace with the Attached Synapse Spark pool. This option may be suitable for the users who want to reuse an existing Azure Synapse Spark pool. Attachment of an Azure Synapse Spark pool to the Azure Machine Learning workspace requires [other steps](./how-to-manage-synapse-spark-pool.md), before the Azure Synapse Spark pool can be used in the Azure Machine Learning for
6470

65-
- [interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
66-
- [Spark batch job submission](./how-to-submit-spark-jobs.md), or
67-
- [running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
71+
A Spark pool created in an Azure Synapse workspace becomes available in the Azure Machine Learning workspace with the attached Synapse Spark pool. This option might be suitable for users who want to reuse an existing Synapse Spark pool.
72+
73+
Attachment of a Synapse Spark pool to an Azure Machine Learning workspace requires [other steps](./how-to-manage-synapse-spark-pool.md) before you can use the pool in Azure Machine Learning for:
74+
75+
- [Interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
76+
- [Spark batch job submission](./how-to-submit-spark-jobs.md)
77+
- [Running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
6878

69-
While an attached Synapse Spark pool provides access to native Synapse features, the user is responsible for provisioning, attaching, configuring, and managing the Synapse Spark pool.
79+
An attached Synapse Spark pool provides access to native Azure Synapse features. The user is responsible for provisioning, attaching, configuring, and managing the Synapse Spark pool.
7080

71-
The Spark session configuration for an attached Synapse Spark pool also offers an option to define a session timeout (in minutes). The session timeout behavior resembles the description seen in [the previous section](#inactivity-periods-and-tear-down-mechanism), except the associated resources are never torn down after the session timeout.
81+
The Spark session configuration for an attached Synapse Spark pool also offers an option to define a session timeout (in minutes). The session timeout behavior resembles the description in [the previous section](#inactivity-periods-and-tear-down-mechanism), except that the associated resources are never torn down after the session timeout.
7282

7383
## Defining Spark cluster size
74-
You can define three parameter values
7584

76-
- number of executors
77-
- executor cores
78-
- executor memory
85+
You can define Spark cluster size by using three parameter values in Azure Machine Learning Spark jobs:
7986

80-
in Azure Machine Learning Spark jobs. You should consider an Azure Machine Learning Apache Spark executor as an equivalent of Azure Spark worker nodes. An example will explain these parameters. Let's say that you have defined number of executors as 6 (equivalent to six worker nodes), executor cores as 4, and executor memory as 28 GB. Your Spark job will then have access to a cluster with 24 cores and 168-GB memory.
87+
- Number of executors
88+
- Executor cores
89+
- Executor memory
90+
91+
You should consider an Azure Machine Learning Apache Spark executor as an equivalent of Azure Spark worker nodes. An example can explain these parameters. Let's say that you defined the number of executors as 6 (equivalent to six worker nodes), executor cores as 4, and executor memory as 28 GB. Your Spark job will then have access to a cluster with 24 cores and 168 GB of memory.
8192

8293
## Ensuring resource access for Spark jobs
83-
To access data and other resources, a Spark job can either use either user identity passthrough, or a managed identity. This table summarizes the different mechanisms Spark jobs use to access resources.
94+
95+
To access data and other resources, a Spark job can use either a user identity passthrough or a managed identity. This table summarizes the mechanisms that Spark jobs use to access resources.
8496

8597
|Spark pool|Supported identities|Default identity|
8698
| ---------- | -------------------- | ---------------- |
8799
|Managed (Automatic) Spark compute|User identity and managed identity|User identity|
88100
|Attached Synapse Spark pool|User identity and managed identity|Managed identity - compute identity of the attached Synapse Spark pool|
89101

90-
[This page](./how-to-submit-spark-jobs.md#ensuring-resource-access-for-spark-jobs) describes Spark job resource access. In a Notebooks session, both the Managed (Automatic) Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
102+
[This article](./how-to-submit-spark-jobs.md#ensuring-resource-access-for-spark-jobs) describes resource access for Spark jobs. In a notebook session, both the Managed (Automatic) Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
91103

92104
> [!NOTE]
93-
> - To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles, on the Azure storage account used for data input and output, to the identity used for the Spark job.
94-
> - If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace, and that workspace has an associated managed virtual network associated, [configure a managed private endpoint to storage account](../synapse-analytics/security/connect-to-a-secure-storage-account.md), to ensure data access.
105+
> To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles (on the Azure storage account that's used for data input and output) to the identity that's used for submitting the Spark job.
106+
>
107+
> If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace, and that workspace has an associated managed virtual network, [configure a managed private endpoint to a storage account](../synapse-analytics/security/connect-to-a-secure-storage-account.md). This configuration will help ensure data access.
95108
96-
This [quickstart guide](./quickstart-spark-jobs.md) describes how to start using Managed (Automatic) Spark compute to submit your Spark jobs in Azure Machine Learning.
109+
[This quickstart](./quickstart-spark-jobs.md) describes how to start using Managed (Automatic) Spark compute to submit your Spark jobs in Azure Machine Learning.
97110

98111
## Next steps
99-
- [Quickstart: Apache Spark jobs in Azure Machine Learning (preview)](./quickstart-spark-jobs.md)
112+
100113
- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
101-
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
114+
- [Interactive data wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
102115
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)
103-
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
104-
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
116+
- [Code samples for Spark jobs using the Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
117+
- [Code samples for Spark jobs using the Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)

0 commit comments

Comments
 (0)