Skip to content

Commit 8cd1763

Browse files
Merge pull request #171 from fbsolo-ms1/document-freshness-maintenance
Freshness update for apache-spark-azure-ml-concepts.md . . .
2 parents 2166279 + 5f29176 commit 8cd1763

File tree

1 file changed

+15
-17
lines changed

1 file changed

+15
-17
lines changed

articles/machine-learning/apache-spark-azure-ml-concepts.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
---
22
title: "Apache Spark in Azure Machine Learning"
33
titleSuffix: Azure Machine Learning
4-
description: This article explains the options for accessing Apache Spark in Azure Machine Learning.
4+
description: This article explains the available options to access Apache Spark in Azure Machine Learning.
55
services: machine-learning
66
ms.service: azure-machine-learning
77
ms.subservice: mldata
88
ms.topic: conceptual
99
author: fbsolo-ms1
1010
ms.author: franksolomon
1111
ms.reviewer: yogipandey
12-
ms.date: 10/05/2023
12+
ms.date: 09/06/2024
1313
ms.custom: cliv2, sdkv2, build-2023
1414
#Customer intent: As a full-stack machine learning pro, I want to use Apache Spark in Azure Machine Learning.
1515
---
@@ -23,13 +23,13 @@ Azure Machine Learning integration with Azure Synapse Analytics provides easy ac
2323

2424
## Serverless Spark compute
2525

26-
With the Apache Spark framework, Azure Machine Learning serverless Spark compute is the easiest way to accomplish distributed computing tasks in the Azure Machine Learning environment. Azure Machine Learning offers a fully managed, serverless, on-demand Apache Spark compute cluster. Its users can avoid the need to create an Azure Synapse workspace and a Synapse Spark pool.
26+
With the Apache Spark framework, Azure Machine Learning serverless Spark compute is the easiest way to accomplish distributed computing tasks in the Azure Machine Learning environment. Azure Machine Learning offers a fully managed, serverless, on-demand Apache Spark compute cluster. Its users can avoid the need to create both an Azure Synapse workspace and a Synapse Spark pool.
2727

2828
Users can define resources, including instance type and the Apache Spark runtime version. They can then use those resources to access serverless Spark compute, in Azure Machine Learning notebooks, for:
2929

3030
- [Interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
31-
- [Spark batch job submissions](./how-to-submit-spark-jobs.md)
3231
- [Running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
32+
- [Spark batch job submissions](./how-to-submit-spark-jobs.md)
3333

3434
### Points to consider
3535

@@ -58,53 +58,51 @@ To use network isolation with Azure Machine Learning and serverless Spark comput
5858

5959
### Inactivity periods and tear-down mechanism
6060

61-
At first launch, a serverless Spark compute (*cold start*) resource might need three to five minutes to start the Spark session itself. The automated serverless Spark compute provisioning, backed by Azure Synapse, causes this delay. After the serverless Spark compute is provisioned, and an Apache Spark session starts, subsequent code executions (*warm start*) won't experience this delay.
61+
At first launch, a serverless Spark compute (*cold start*) resource might need three to five minutes to start the Spark session itself. Provisioning of the automated serverless Spark compute resource, backed by Azure Synapse, causes this delay. After the serverless Spark compute is provisioned, and an Apache Spark session starts, subsequent code executions (*warm start*) won't experience this delay.
6262

63-
The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will end after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following 10 minutes, resources provisioned for the serverless Spark compute will be torn down.
63+
The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will end after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following 10 minutes, resources provisioned for the serverless Spark compute are torn down.
6464

6565
After the serverless Spark compute resource tear-down happens, submission of the next job will require a *cold start*. The next visualization shows some session inactivity period and cluster teardown scenarios.
6666

6767
:::image type="content" source="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" lightbox="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" alt-text="Expandable diagram that shows scenarios for Apache Spark session inactivity period and cluster teardown.":::
6868

6969
### Session-level Conda packages
70-
A Conda dependency YAML file can define many session-level Conda packages in a session configuration. A session will time out if it needs more than 15 minutes to install the Conda packages defined in the YAML file. It becomes important to first check whether a required package is already available in the Azure Synapse base image. To do this, users should follow the link to determine *packages available in the base image for* the Apache Spark version in use:
70+
A Conda dependency YAML file can define many session-level Conda packages in a session configuration. A session times out if it needs more than 15 minutes to install the Conda packages defined in the YAML file. It becomes important to first check whether a required package is already available in the Azure Synapse base image. To do this, users should visit these resources to determine *packages available in the base image for* the Apache Spark version in use:
7171
- [Azure Synapse Runtime for Apache Spark 3.3](https://github.com/microsoft/synapse-spark-runtime/tree/main/Synapse/spark3.3)
72-
73-
7472
- [Azure Synapse Runtime for Apache Spark 3.2](https://github.com/microsoft/synapse-spark-runtime/tree/main/Synapse/spark3.2)
7573

7674
> [!IMPORTANT]
7775
> Azure Synapse Runtime for Apache Spark: Announcements
7876
> * Azure Synapse Runtime for Apache Spark 3.2:
7977
> * EOLA Announcement Date: July 8, 2023
8078
> * End of Support Date: July 8, 2024. After this date, the runtime will be disabled.
81-
> * For continued support and optimal performance, we advise that you migrate to
79+
> * For continued support and optimal performance, we advise that you migrate to Apache Spark 3.4
8280
8381
> [!NOTE]
8482
> For a session-level Conda package:
8583
> - the *Cold start* will need about ten to fifteen minutes.
8684
> - the *Warm start*, using same Conda package, will need about one minute.
8785
> - the *Warm start*, with a different Conda package, will also need about ten to fifteen minutes.
88-
> - If the package that you install is large or needs a long installation time, it might impact the Spark instance startup time.
89-
> - Altering the PySpark, Python, Scala/Java, .NET, or Spark version is not supported.
86+
> - If you install a large package, or a package that needs a long installation time, it might impact the Spark instance startup time.
87+
> - Alteration of the PySpark, Python, Scala/Java, .NET, or Spark version is not supported.
9088
> - Docker images are not supported.
9189
9290
### Improving session cold start time while using session-level Conda packages
93-
You can improve the Spark session *cold start* time by setting the `spark.hadoop.aml.enable_cache` configuration variable to `true`. The session *cold start* with session level Conda packages typically takes 10 to 15 minutes when the session starts for the first time. However, subsequent session *cold starts* take three to five minutes. Define the configuration variable in the **Configure session** user interface, under **Configuration settings**.
91+
You can set the `spark.hadoop.aml.enable_cache` configuration variable to `true`, to improve the Spark session *cold start* time. With session level Conda packages, the session *cold start* typically takes 10 to 15 minutes when the session starts for the first time. However, subsequent session *cold starts* take three to five minutes. Define the configuration variable in the **Configure session** user interface, under **Configuration settings**.
9492

9593
:::image type="content" source="./media/apache-spark-azure-ml-concepts/spark-session-enable-cache.png" lightbox="./media/apache-spark-azure-ml-concepts/spark-session-enable-cache.png" alt-text="Expandable diagram that shows the Spark session configuration tag that enables cache.":::
9694

9795
## Attached Synapse Spark pool
9896

9997
A Spark pool created in an Azure Synapse workspace becomes available in the Azure Machine Learning workspace with the attached Synapse Spark pool. This option might be suitable for users who want to reuse an existing Synapse Spark pool.
10098

101-
Attachment of a Synapse Spark pool to an Azure Machine Learning workspace requires [other steps](./how-to-manage-synapse-spark-pool.md) before you can use the pool in Azure Machine Learning for:
99+
Attachment of a Synapse Spark pool to an Azure Machine Learning workspace requires [more steps](./how-to-manage-synapse-spark-pool.md) before you can use the pool in Azure Machine Learning for:
102100

103101
- [Interactive Spark code development](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
104102
- [Spark batch job submission](./how-to-submit-spark-jobs.md)
105103
- [Running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
106104

107-
An attached Synapse Spark pool provides access to native Azure Synapse features. The user is responsible for the Synapse Spark pool provisioning, attaching, configuration, and management.
105+
An attached Synapse Spark pool provides access to native Azure Synapse features. The user is responsible for the provisioning, attaching, configuration, and management of the Synapse Spark pool.
108106

109107
The Spark session configuration for an attached Synapse Spark pool also offers an option to define a session timeout (in minutes). The session timeout behavior resembles the description in [the previous section](#inactivity-periods-and-tear-down-mechanism), except that the associated resources are never torn down after the session timeout.
110108

@@ -127,10 +125,10 @@ To access data and other resources, a Spark job can use either a managed identit
127125
|Serverless Spark compute|User identity, user-assigned managed identity attached to the workspace|User identity|
128126
|Attached Synapse Spark pool|User identity, user-assigned managed identity attached to the attached Synapse Spark pool, system-assigned managed identity of the attached Synapse Spark pool|System-assigned managed identity of the attached Synapse Spark pool|
129127

130-
[This article](./apache-spark-environment-configuration.md#ensuring-resource-access-for-spark-jobs) describes resource access for Spark jobs. In a notebook session, both the serverless Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
128+
[This article](./apache-spark-environment-configuration.md#ensuring-resource-access-for-spark-jobs) describes resource access for Spark jobs. In a notebook session, both the serverless Spark compute and the attached Synapse Spark pool rely on user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
131129

132130
> [!NOTE]
133-
> - To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles (on the Azure storage account used for data input and output) to the identity that will be used for the Spark job submission.
131+
> - To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles (on the Azure storage account used for data input and output) to the identity that you will use for the Spark job submission.
134132
> - If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace, and that workspace has an associated managed virtual network, [configure a managed private endpoint to a storage account](/azure/synapse-analytics/security/connect-to-a-secure-storage-account). This configuration will help ensure data access.
135133
136134
## Next steps

0 commit comments

Comments
 (0)