Skip to content

Commit 5aa495f

Browse files
authored
Merge pull request #223230 from fbsolo-ms1/tutorial-for-SK
Yogi P requested a file update . . .
2 parents b8f4de9 + d05f2b9 commit 5aa495f

File tree

1 file changed

+67
-66
lines changed

1 file changed

+67
-66
lines changed

articles/machine-learning/quickstart-spark-jobs.md

Lines changed: 67 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,6 @@ In this quickstart guide, you'll learn how to submit a Spark job using Azure Mac
2222

2323
## Prerequisites
2424

25-
# [Studio UI](#tab/studio-ui)
26-
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
27-
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
28-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
29-
- To enable this feature:
30-
1. Navigate to Azure Machine Learning studio UI.
31-
2. Select **Manage preview features** (megaphone icon) among the icons on the top right side of the screen.
32-
3. In **Managed preview feature** panel, toggle on **Run notebooks and jobs on managed Spark** feature.
33-
:::image type="content" source="media/quickstart-spark-jobs/how-to-enable-managed-spark-preview.png" lightbox="media/quickstart-spark-jobs/how-to-enable-managed-spark-preview.png" alt-text="Expandable screenshot showing option for enabling Managed Spark preview.":::
34-
3525
# [CLI](#tab/cli)
3626
[!INCLUDE [cli v2](../../includes/machine-learning-cli-v2.md)]
3727
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
@@ -60,6 +50,16 @@ In this quickstart guide, you'll learn how to submit a Spark job using Azure Mac
6050
> - [Visual Studio Code connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
6151
> - your local computer that has [the Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/installv2) installed.
6252
53+
# [Studio UI](#tab/studio-ui)
54+
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
55+
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
56+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
57+
- To enable this feature:
58+
1. Navigate to Azure Machine Learning studio UI.
59+
2. Select **Manage preview features** (megaphone icon) among the icons on the top right side of the screen.
60+
3. In **Managed preview feature** panel, toggle on **Run notebooks and jobs on managed Spark** feature.
61+
:::image type="content" source="media/quickstart-spark-jobs/how-to-enable-managed-spark-preview.png" lightbox="media/quickstart-spark-jobs/how-to-enable-managed-spark-preview.png" alt-text="Expandable screenshot showing option for enabling Managed Spark preview.":::
62+
6363
---
6464

6565
## Add role assignments in Azure storage accounts
@@ -132,62 +132,6 @@ The above script takes two arguments `--titanic_data` and `--wrangled_data`, whi
132132

133133
## Submit a standalone Spark job
134134

135-
# [Studio UI](#tab/studio-ui)
136-
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for workspace default datastore `workspaceblobstore`. To submit a standalone Spark job using the Azure Machine Learning studio UI:
137-
138-
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job.png" alt-text="Expandable screenshot showing creation of a new Spark job in Azure Machine Learning studio UI.":::
139-
140-
1. In the left pane, select **+ New**.
141-
2. Select **Spark job (preview)**.
142-
3. On the **Compute** screen:
143-
144-
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" alt-text="Expandable screenshot showing compute selection screen for a new Spark job in Azure Machine Learning studio UI.":::
145-
146-
1. Under **Select compute type**, select **Spark automatic compute (Preview)** for Managed (Automatic) Spark compute.
147-
2. Select **Virtual machine size**. The following instance types are currently supported:
148-
- `Standard_E4s_v3`
149-
- `Standard_E8s_v3`
150-
- `Standard_E16s_v3`
151-
- `Standard_E32s_v3`
152-
- `Standard_E64s_v3`
153-
3. Select **Spark runtime version** as **Spark 3.2**.
154-
4. Select **Next**.
155-
4. On the **Environment** screen, select **Next**.
156-
5. On **Job settings** screen:
157-
1. Provide a job **Name**, or use the job **Name**, which is generated by default.
158-
2. Select an **Experiment name** from the dropdown menu.
159-
3. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
160-
4. Under the **Code** section:
161-
1. Select **Azure Machine Learning workspace default blob storage** from **Choose code location** dropdown.
162-
2. Under **Path to code file to upload**, select **Browse**.
163-
3. In the pop-up screen titled **Path selection**, select the path of code file `titanic.py` on the workspace default datastore `workspaceblobstore`.
164-
4. Select **Save**.
165-
5. Input `titanic.py` as the name of **Entry file** for the standalone job.
166-
6. To add an input, select **+ Add input** under **Inputs** and
167-
1. Enter **Input name** as `titanic_data`. The input should refer to this name later in the **Arguments**.
168-
2. Select **Input type** as **Data**.
169-
3. Select **Data type** as **File**.
170-
4. Select **Data source** as **URI**.
171-
5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
172-
7. To add an input, select **+ Add output** under **Outputs** and
173-
1. Enter **Output name** as `wrangled_data`. The output should refer to this name later in the **Arguments**.
174-
2. Select **Output type** as **Folder**.
175-
3. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
176-
8. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
177-
5. Under the **Spark configurations** section:
178-
1. For **Executor size**:
179-
1. Enter the number of executor **Cores** as 2 and executor **Memory (GB)** as 2.
180-
2. For **Dynamically allocated executors**, select **Disabled**.
181-
3. Enter the number of **Executor instances** as 2.
182-
2. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
183-
6. Select **Next**.
184-
6. On the **Review** screen:
185-
1. Review the job specification before submitting it.
186-
2. Select **Create** to submit the standalone Spark job.
187-
188-
> [!NOTE]
189-
> A standalone job submitted from the Studio UI using an Azure Machine Learning Managed (Automatic) Spark compute defaults to user identity passthrough for data access.
190-
191135
# [CLI](#tab/cli)
192136
[!INCLUDE [cli v2](../../includes/machine-learning-cli-v2.md)]
193137
This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning Managed (Automatic) Spark compute, user identity passthrough, and input/output data URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`:
@@ -308,6 +252,63 @@ In the above code sample:
308252
- `Standard_E32S_V3`
309253
- `Standard_E64S_V3`
310254

255+
# [Studio UI](#tab/studio-ui)
256+
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for workspace default datastore `workspaceblobstore`. To submit a standalone Spark job using the Azure Machine Learning studio UI:
257+
258+
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job.png" alt-text="Expandable screenshot showing creation of a new Spark job in Azure Machine Learning studio UI.":::
259+
260+
1. In the left pane, select **+ New**.
261+
2. Select **Spark job (preview)**.
262+
3. On the **Compute** screen:
263+
264+
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" alt-text="Expandable screenshot showing compute selection screen for a new Spark job in Azure Machine Learning studio UI.":::
265+
266+
1. Under **Select compute type**, select **Spark automatic compute (Preview)** for Managed (Automatic) Spark compute.
267+
2. Select **Virtual machine size**. The following instance types are currently supported:
268+
- `Standard_E4s_v3`
269+
- `Standard_E8s_v3`
270+
- `Standard_E16s_v3`
271+
- `Standard_E32s_v3`
272+
- `Standard_E64s_v3`
273+
3. Select **Spark runtime version** as **Spark 3.2**.
274+
4. Select **Next**.
275+
4. On the **Environment** screen, select **Next**.
276+
5. On **Job settings** screen:
277+
1. Provide a job **Name**, or use the job **Name**, which is generated by default.
278+
2. Select an **Experiment name** from the dropdown menu.
279+
3. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
280+
4. Under the **Code** section:
281+
1. Select **Azure Machine Learning workspace default blob storage** from **Choose code location** dropdown.
282+
2. Under **Path to code file to upload**, select **Browse**.
283+
3. In the pop-up screen titled **Path selection**, select the path of code file `titanic.py` on the workspace default datastore `workspaceblobstore`.
284+
4. Select **Save**.
285+
5. Input `titanic.py` as the name of **Entry file** for the standalone job.
286+
6. To add an input, select **+ Add input** under **Inputs** and
287+
1. Enter **Input name** as `titanic_data`. The input should refer to this name later in the **Arguments**.
288+
2. Select **Input type** as **Data**.
289+
3. Select **Data type** as **File**.
290+
4. Select **Data source** as **URI**.
291+
5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
292+
7. To add an input, select **+ Add output** under **Outputs** and
293+
1. Enter **Output name** as `wrangled_data`. The output should refer to this name later in the **Arguments**.
294+
2. Select **Output type** as **Folder**.
295+
3. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
296+
8. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
297+
5. Under the **Spark configurations** section:
298+
1. For **Executor size**:
299+
1. Enter the number of executor **Cores** as 2 and executor **Memory (GB)** as 2.
300+
2. For **Dynamically allocated executors**, select **Disabled**.
301+
3. Enter the number of **Executor instances** as 2.
302+
2. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
303+
6. Select **Next**.
304+
6. On the **Review** screen:
305+
1. Review the job specification before submitting it.
306+
2. Select **Create** to submit the standalone Spark job.
307+
308+
> [!NOTE]
309+
> A standalone job submitted from the Studio UI using an Azure Machine Learning Managed (Automatic) Spark compute defaults to user identity passthrough for data access.
310+
311+
311312
---
312313

313314
> [!TIP]

0 commit comments

Comments
 (0)