Skip to content

Commit 7d5ed10

Browse files
authored
Merge pull request #227030 from fbsolo-ms1/updates-for-YP
Yogi P requested edits / new .PNG's . . .
2 parents c5b9cfa + 238c622 commit 7d5ed10

File tree

8 files changed

+67
-42
lines changed

8 files changed

+67
-42
lines changed

articles/machine-learning/apache-spark-azure-ml-concepts.md

Lines changed: 31 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,21 @@ ms.topic: conceptual
99
ms.author: franksolomon
1010
author: ynpandey
1111
ms.reviewer: franksolomon
12-
ms.date: 01/30/2023
12+
ms.date: 02/10/2023
1313
ms.custom: cliv2, sdkv2
1414
#Customer intent: As a full-stack machine learning pro, I want to use Apache Spark in Azure Machine Learning.
1515
---
1616

1717
# Apache Spark in Azure Machine Learning (preview)
1818

19-
Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to distributed computing through the Apache Spark framework. This integration offers these Apache Spark computing experiences:
19+
Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to distributed computation resources through the Apache Spark framework. This integration offers these Apache Spark computing experiences:
2020

2121
- Managed (Automatic) Spark compute
2222
- Attached Synapse Spark pool
2323

2424
## Managed (Automatic) Spark compute
2525

26-
Azure Machine Learning Managed (Automatic) Spark compute is the easiest way to accomplish distributed computing tasks in the Azure Machine Learning environment by using the Apache Spark framework. Azure Machine Learning users can use a fully managed, serverless, on-demand Apache Spark compute cluster. Those users can avoid the need to create an Azure Synapse workspace and a Synapse Spark pool.
26+
With the Apache Spark framework, Azure Machine Learning Managed (Automatic) Spark compute is the easiest way to accomplish distributed computing tasks in the Azure Machine Learning environment. Azure Machine Learning offers a fully managed, serverless, on-demand Apache Spark compute cluster. Its users can avoid the need to create an Azure Synapse workspace and a Synapse Spark pool.
2727

2828
Users can define resources, including instance type and Apache Spark runtime version. They can then use those resources to access Managed (Automatic) Spark compute in Azure Machine Learning notebooks for:
2929

@@ -33,39 +33,45 @@ Users can define resources, including instance type and Apache Spark runtime ver
3333

3434
### Points to consider
3535

36-
Managed (Automatic) Spark compute works well for most user scenarios that require quick access to distributed computing through Apache Spark. But to make an informed decision, users should consider the advantages and disadvantages of this approach.
36+
Managed (Automatic) Spark compute works well for most user scenarios that require quick access to distributed computing through Apache Spark. However, to make an informed decision, users should consider the advantages and disadvantages of this approach.
3737

3838
Advantages:
3939

40-
- There are no dependencies on other Azure resources to be created for Apache Spark.
41-
- No permissions are required in the subscription to create Azure Synapse-related resources.
42-
- There's no need for SQL pool quotas.
40+
- No dependencies on other Azure resources to be created for Apache Spark (Azure Synapse infrastructure operates under the hood).
41+
- No required subscription permissions to create Azure Synapse-related resources.
42+
- No need for SQL pool quotas.
4343

4444
Disadvantages:
4545

4646
- A persistent Hive metastore is missing. Managed (Automatic) Spark compute supports only in-memory Spark SQL.
47-
- No tables or databases are available.
48-
- Azure Purview integration is missing.
49-
- Linked services aren't available.
50-
- There are fewer data sources and connectors.
51-
- Pool-level configuration is missing.
52-
- Pool-level library management is missing.
53-
- There's only partial support for `mssparkutils`.
47+
- No available tables or databases.
48+
- Missing Azure Purview integration.
49+
- No available linked services.
50+
- Fewer data sources and connectors.
51+
- No pool-level configuration.
52+
- No pool-level library management.
53+
- Only partial support for `mssparkutils`.
5454

5555
### Network configuration
5656

57-
As of January 2023, creating a Managed (Automatic) Spark compute inside a virtual network and creating a private endpoint to Azure Synapse are not supported.
57+
As of January 2023, creation of a Managed (Automatic) Spark compute, inside a virtual network, and creation of a private endpoint to Azure Synapse, aren't supported.
5858

5959
### Inactivity periods and tear-down mechanism
6060

61-
A Managed (Automatic) Spark compute (*cold start*) resource might need three to five minutes to start the Spark session when it's first launched. The automated Managed (Automatic) Spark compute provisioning, backed by Azure Synapse, causes this delay. After the Managed (Automatic) Spark compute is provisioned and an Apache Spark session starts, subsequent code executions (*warm start*) won't experience this delay.
61+
At first launch, Managed (Automatic) Spark compute (*cold start*) resource might need three to five minutes to start the Spark session itself. The automated Managed (Automatic) Spark compute provisioning, backed by Azure Synapse, causes this delay. After the Managed (Automatic) Spark compute is provisioned, and an Apache Spark session starts, subsequent code executions (*warm start*) won't experience this delay.
6262

63-
The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will end after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following 10 minutes, resources provisioned for the Managed (Automatic) Spark compute will be torn down.
63+
The Spark session configuration offers an option that defines a session timeout (in minutes). The Spark session will end after an inactivity period that exceeds the user-defined timeout. If another Spark session doesn't start in the following ten minutes, resources provisioned for the Managed (Automatic) Spark compute will be torn down.
6464

6565
After the Managed (Automatic) Spark compute resource tear-down happens, submission of the next job will require a *cold start*. The next visualization shows some session inactivity period and cluster teardown scenarios.
6666

6767
:::image type="content" source="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" lightbox="./media/apache-spark-azure-ml-concepts/spark-session-timeout-teardown.png" alt-text="Expandable diagram that shows scenarios for Apache Spark session inactivity period and cluster teardown.":::
6868

69+
> [!NOTE]
70+
> For a session-level conda package:
71+
> - *Cold start* time will need about ten to fifteen minutes.
72+
> - *Warm start* time using same conda package will need about one minute.
73+
> - *Warm start* with a different conda package will also need about ten to fifteen minutes.
74+
6975
## Attached Synapse Spark pool
7076

7177
A Spark pool created in an Azure Synapse workspace becomes available in the Azure Machine Learning workspace with the attached Synapse Spark pool. This option might be suitable for users who want to reuse an existing Synapse Spark pool.
@@ -76,19 +82,19 @@ Attachment of a Synapse Spark pool to an Azure Machine Learning workspace requir
7682
- [Spark batch job submission](./how-to-submit-spark-jobs.md)
7783
- [Running machine learning pipelines with a Spark component](./how-to-submit-spark-jobs.md#spark-component-in-a-pipeline-job)
7884

79-
An attached Synapse Spark pool provides access to native Azure Synapse features. The user is responsible for provisioning, attaching, configuring, and managing the Synapse Spark pool.
85+
An attached Synapse Spark pool provides access to native Azure Synapse features. The user is responsible for the Synapse Spark pool provisioning, attaching, configuration, and management.
8086

8187
The Spark session configuration for an attached Synapse Spark pool also offers an option to define a session timeout (in minutes). The session timeout behavior resembles the description in [the previous section](#inactivity-periods-and-tear-down-mechanism), except that the associated resources are never torn down after the session timeout.
8288

8389
## Defining Spark cluster size
8490

85-
You can define Spark cluster size by using three parameter values in Azure Machine Learning Spark jobs:
91+
You can define Spark cluster size with three parameter values in Azure Machine Learning Spark jobs:
8692

8793
- Number of executors
8894
- Executor cores
8995
- Executor memory
9096

91-
You should consider an Azure Machine Learning Apache Spark executor as an equivalent of Azure Spark worker nodes. An example can explain these parameters. Let's say that you defined the number of executors as 6 (equivalent to six worker nodes), executor cores as 4, and executor memory as 28 GB. Your Spark job will then have access to a cluster with 24 cores and 168 GB of memory.
97+
You should consider an Azure Machine Learning Apache Spark executor as an equivalent of Azure Spark worker nodes. An example can explain these parameters. Let's say that you defined the number of executors as 6 (equivalent to six worker nodes), executor cores as 4, and executor memory as 28 GB. Your Spark job then has access to a cluster with 24 cores and 168 GB of memory.
9298

9399
## Ensuring resource access for Spark jobs
94100

@@ -102,14 +108,15 @@ To access data and other resources, a Spark job can use either a user identity p
102108
[This article](./how-to-submit-spark-jobs.md#ensuring-resource-access-for-spark-jobs) describes resource access for Spark jobs. In a notebook session, both the Managed (Automatic) Spark compute and the attached Synapse Spark pool use user identity passthrough for data access during [interactive data wrangling](./interactive-data-wrangling-with-apache-spark-azure-ml.md).
103109

104110
> [!NOTE]
105-
> To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles (on the Azure storage account that's used for data input and output) to the identity that's used for submitting the Spark job.
106-
>
107-
> If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace, and that workspace has an associated managed virtual network, [configure a managed private endpoint to a storage account](../synapse-analytics/security/connect-to-a-secure-storage-account.md). This configuration will help ensure data access.
111+
> - To ensure successful Spark job execution, assign **Contributor** and **Storage Blob Data Contributor** roles (on the Azure storage account used for data input and output) to the identity that's used for submitting the Spark job.
112+
> - If an [attached Synapse Spark pool](./how-to-manage-synapse-spark-pool.md) points to a Synapse Spark pool in an Azure Synapse workspace, and that workspace has an associated managed virtual network, [configure a managed private endpoint to a storage account](../synapse-analytics/security/connect-to-a-secure-storage-account.md). This configuration will help ensure data access.
113+
> - Both Managed (Automatic) Spark compute and attached Synapse Spark pool do not work in a notebook created in a private link enabled workspace.
108114
109-
[This quickstart](./quickstart-spark-jobs.md) describes how to start using Managed (Automatic) Spark compute to submit your Spark jobs in Azure Machine Learning.
115+
[This quickstart](./quickstart-spark-data-wrangling.md) describes how to start using Managed (Automatic) Spark compute in Azure Machine Learning.
110116

111117
## Next steps
112118

119+
- [Quickstart: Submit Apache Spark jobs in Azure Machine Learning (preview)](./quickstart-spark-jobs.md)
113120
- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
114121
- [Interactive data wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
115122
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)
-825 Bytes
Loading
50.8 KB
Loading
43 KB
Loading
50.8 KB
Loading
43 KB
Loading

articles/machine-learning/quickstart-spark-data-wrangling.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,14 @@ ms.reviewer: franksolomon
88
ms.service: machine-learning
99
ms.subservice: mldata
1010
ms.topic: quickstart
11-
ms.date: 02/06/2023
11+
ms.date: 02/10/2023
1212
#Customer intent: As a Full Stack ML Pro, I want to perform interactive data wrangling in Azure Machine Learning, with Apache Spark.
1313
---
1414

1515
# Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
1616

1717
[!INCLUDE [preview disclaimer](../../includes/machine-learning-preview-generic-disclaimer.md)]
1818

19-
2019
To handle interactive Azure Machine Learning notebook data wrangling, Azure Machine Learning integration, with Azure Synapse Analytics (preview), provides easy access to the Apache Spark framework. This access allows for Azure Machine Learning Notebook interactive data wrangling.
2120

2221
In this quickstart guide, you'll learn how to perform interactive data wrangling using Azure Machine Learning Managed (Automatic) Synapse Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough.
@@ -37,7 +36,15 @@ We must ensure that the input and output data paths are accessible, before we st
3736

3837
To assign appropriate roles to the user identity:
3938

40-
1. In the Microsoft Azure portal, navigate to the Azure Data Lake Storage (ADLS) Gen 2 storage account page
39+
1. Open the [Microsoft Azure portal](https://portal.azure.com).
40+
1. Search and select the **Storage accounts** service.
41+
42+
:::image type="content" source="media/quickstart-spark-data-wrangling/find-storage-accounts-service.png" lightbox="media/quickstart-spark-data-wrangling/find-storage-accounts-service.png" alt-text="Expandable screenshot showing Storage accounts service search and selection, in Microsoft Azure portal.":::
43+
44+
1. On the **Storage accounts** page, select the Azure Data Lake Storage (ADLS) Gen 2 storage account from the list. A page showing the storage account **Overview** will open.
45+
46+
:::image type="content" source="media/quickstart-spark-data-wrangling/storage-accounts-list.png" lightbox="media/quickstart-spark-data-wrangling/storage-accounts-list.png" alt-text="Expandable screenshot showing selection of the Azure Data Lake Storage (ADLS) Gen 2 storage account Storage account.":::
47+
4148
1. Select **Access Control (IAM)** from the left panel
4249
1. Select **Add role assignment**
4350

@@ -73,9 +80,9 @@ A Managed (Automatic) Spark compute is available in Azure Machine Learning Noteb
7380
## Interactive data wrangling with Titanic data
7481

7582
> [!TIP]
76-
> Data wrangling with a Managed (Automatic) Spark compute, and user identity passthrough for data access in a Azure Data Lake Storage (ADLS) Gen 2 storage account, both require the lowest number of configuration steps.
83+
> Data wrangling with a Managed (Automatic) Spark compute, and user identity passthrough for data access in an Azure Data Lake Storage (ADLS) Gen 2 storage account, both require the lowest number of configuration steps.
7784
78-
The data wrangling code shown here uses the `titanic.csv` file, available [here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/spark/data/titanic.csv). Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account. This Python code snippet shows interactive data wrangling with an Azure Machine Learning Managed (Automatic) Spark compute, user identity passthrough, and an input/output data URI, in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`:
85+
The data wrangling code shown here uses the `titanic.csv` file, available [here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/spark/data/titanic.csv). Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account. This Python code snippet shows interactive data wrangling with an Azure Machine Learning Managed (Automatic) Spark compute, user identity passthrough, and an input/output data URI, in the `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>` format. Here, `<FILE_SYSTEM_NAME>` matches the container name.
7986

8087
```python
8188
import pyspark.pandas as pd
@@ -105,6 +112,7 @@ df.to_csv(
105112

106113
## Next steps
107114
- [Apache Spark in Azure Machine Learning (preview)](./apache-spark-azure-ml-concepts.md)
115+
- [Quickstart: Submit Apache Spark jobs in Azure Machine Learning (preview)](./quickstart-spark-jobs.md)
108116
- [Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](./how-to-manage-synapse-spark-pool.md)
109117
- [Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)](./interactive-data-wrangling-with-apache-spark-azure-ml.md)
110118
- [Submit Spark jobs in Azure Machine Learning (preview)](./how-to-submit-spark-jobs.md)

0 commit comments

Comments
 (0)