Skip to content

Commit fa7736c

Browse files
authored
Merge pull request #225676 from fbsolo-ms1/updates-for-YP
Yogi P requested file updates . . .
2 parents 085205e + d6203ad commit fa7736c

File tree

3 files changed

+106
-0
lines changed

3 files changed

+106
-0
lines changed
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
---
2+
title: "Apache Spark in Azure Machine Learning (preview)"
3+
titleSuffix: Azure Machine Learning
4+
description: This article explains difference options for accessing Apache Spark in Azure Machine Learning.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: mldata
8+
ms.topic: conceptual
9+
ms.author: franksolomon
10+
author: ynpandey
11+
ms.reviewer: franksolomon
12+
ms.date: 01/30/2023
13+
ms.custom: cliv2, sdkv2
14+
#Customer intent: As a Full Stack ML Pro, I want to use Apache Spark in Azure Machine Learning.
15+
---
16+
17+
# Apache Spark in Azure Machine Learning (preview)
18+
The Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to distributed computing, using the Apache Spark framework. This integration offers these Apache Spark computing experiences:
19+
- Managed (Automatic) Spark compute
20+
- Attached Synapse Spark pool
21+
22+
## Managed (Automatic) Spark compute
23+
Azure Machine Learning Managed (Automatic) Spark compute is the easiest way to execute distributed computing tasks in the Azure Machine Learning environment, using the Apache Spark framework. Azure Machine Learning users can use a fully managed, serverless, on-demand Apache Spark compute cluster. Those users can avoid the need to create an Azure Synapse Workspace and an Azure Synapse Spark pool. Users can define the resources, including
24+
25+
- instance type
26+
- Apache Spark runtime version
27+
28+
to access the Managed (Automatic) Spark compute in Azure Machine Learning Notebooks, for
29+
30+
- [interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
31+
- [Spark batch job submissions](./how-to-submit-spark-jobs.md)
32+
- [running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
33+
34+
### Some points to consider
35+
Managed (Automatic) Spark compute works well for most user scenarios that require quick access to distributed computing using Apache Spark. To make an informed decision, however, users should consider the advantages and disadvantages of this approach.
36+
37+
### Advantages
38+
39+
- No dependencies on other Azure resources to be created for Apache Spark
40+
- No permissions required in the subscription to create Synapse-related resources
41+
- No need for SQL pool quota
42+
43+
### Disadvantages
44+
45+
- Persistent Hive metastore is missing. Therefore, Managed (Automatic) Spark compute only supports in-memory Spark SQL
46+
- No available tables or databases
47+
- Missing Purview integration
48+
- Linked Services not available
49+
- Fewer Data sources/connectors
50+
- Missing pool-level configuration
51+
- Missing pool-level library management
52+
- Partial support for `mssparkutils`
53+
54+
### Network configuration
55+
As of January 2023, the Managed (Automatic) Spark compute doesn't support managed VNet or private endpoint creation to Azure Synapse.
56+
57+
### Inactivity periods and tear down mechanism
58+
A Managed (Automatic) Spark compute (**cold start**) resource might need three to five minutes to start the Spark session, when first launched. The automated Managed (Automatic) Spark compute provisioning, backed by Azure Synapse, causes this delay. Once the Managed (Automatic) Spark compute is provisioned, and an Apache Spark session starts, subsequent code executions (**warm start**) won't experience this delay. The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will terminate after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following 10 minutes, resources provisioned for the Managed (Automatic) Spark compute will be torn down. Once the Managed (Automatic) Spark compute resource tear-down happens, submission of the next job will require a *cold start*. The next visualization shows some session inactivity period and cluster teardown scenarios.
59+
60+
:::image type="content" source="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" lightbox="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" alt-text="Expandable screenshot that shows different scenarios for Apache spark session inactivity period and cluster teardown.":::
61+
62+
## Attached Synapse Spark pool
63+
A Synapse Spark pool created in an Azure Synapse workspace becomes available in the Azure Machine Learning workspace with the Attached Synapse Spark pool. This option may be suitable for the users who want to reuse an existing Azure Synapse Spark pool. Attachment of an Azure Synapse Spark pool to the Azure Machine Learning workspace requires [other steps](./how-to-manage-synapse-spark-pool.md), before the Azure Synapse Spark pool can be used in the Azure Machine Learning for
64+
65+
- [interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
66+
- [Spark batch job submission](./how-to-submit-spark-jobs.md), or
67+
- [running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
68+
69+
While an attached Synapse Spark pool provides access to native Synapse features, the user is responsible for provisioning, attaching, configuring, and managing the Synapse Spark pool.
70+
71+
The Spark session configuration for an attached Synapse Spark pool also offers an option to define a session timeout (in minutes). The session timeout behavior resembles the description seen in [the previous section](#inactivity-periods-and-tear-down-mechanism), except the associated resources are never torn down after the session timeout.
72+
73+
## Defining Spark cluster size
74+
You can define three parameter values
75+
76+
- number of executors
77+
- executor cores
78+
- executor memory
79+
80+
in Azure Machine Learning Spark jobs. You should consider an Azure Machine Learning Apache Spark executor as an equivalent of Azure Spark worker nodes. An example will explain these parameters. Let's say that you have defined number of executors as 6 (equivalent to six worker nodes), executor cores as 4, and executor memory as 28 GB. Your Spark job will then have access to a cluster with 24 cores and 168-GB memory.
81+
82+
## Ensuring resource access for Spark jobs
83+
To access data and other resources, a Spark job can either use either user identity passthrough, or a managed identity. This table summarizes the different mechanisms Spark jobs use to access resources.
84+
85+
|Spark pool|Supported identities|Default identity|
86+
| ---------- | -------------------- | ---------------- |
87+
|Managed (Automatic) Spark compute|User identity and managed identity|User identity|
88+
|Attached Synapse Spark pool|User identity and managed identity|Managed identity - compute identity of the attached Synapse Spark pool|
89+
90+
[This page](./how-to-submit-spark-jobs.md#ensuring-resource-access-for-spark-jobs) describes Spark job resource access. In a Notebooks session, both the Managed (Automatic) Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
91+
92+
> [!NOTE]
93+
> - To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles, on the Azure storage account used for data input and output, to the identity used for the Spark job.
94+
> - If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace, and that workspace has an associated managed virtual network associated, [configure a managed private endpoint to storage account](../synapse-analytics/security/connect-to-a-secure-storage-account.md), to ensure data access.
95+
96+
This [quickstart guide](./quickstart-spark-jobs.md) describes how to start using Managed (Automatic) Spark compute to submit your Spark jobs in Azure Machine Learning.
97+
98+
## Next steps
99+
- [Quickstart: Apache Spark jobs in Azure Machine Learning (preview)](./quickstart-spark-jobs.md)
100+
- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
101+
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
102+
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)
103+
- [Code samples for Spark jobs using Azure Machine Learning CLI](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark)
104+
- [Code samples for Spark jobs using Azure Machine Learning Python SDK](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/spark)
76.1 KB
Loading

articles/machine-learning/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@
6868
href: quickstart-spark-jobs.md
6969
- name: Run Jupyter notebooks
7070
href: quickstart-run-notebooks.md
71+
- name: Apache Spark in Azure Machine Learning (preview)
72+
href: apache-spark-azure-ml-concepts.md
7173
- name: Tutorials
7274
expanded: true
7375
items:

0 commit comments

Comments
 (0)