You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/quickstart-spark-jobs.md
+67-66Lines changed: 67 additions & 66 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,16 +22,6 @@ In this quickstart guide, you'll learn how to submit a Spark job using Azure Mac
22
22
23
23
## Prerequisites
24
24
25
-
# [Studio UI](#tab/studio-ui)
26
-
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
27
-
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
28
-
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
29
-
- To enable this feature:
30
-
1. Navigate to Azure Machine Learning studio UI.
31
-
2. Select **Manage preview features** (megaphone icon) among the icons on the top right side of the screen.
32
-
3. In **Managed preview feature** panel, toggle on **Run notebooks and jobs on managed Spark** feature.
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
@@ -60,6 +50,16 @@ In this quickstart guide, you'll learn how to submit a Spark job using Azure Mac
60
50
> -[Visual Studio Code connected to an Azure Machine Learning compute instance](./how-to-set-up-vs-code-remote.md?tabs=studio).
61
51
> - your local computer that has [the Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/installv2) installed.
62
52
53
+
# [Studio UI](#tab/studio-ui)
54
+
- An Azure subscription; if you don't have an Azure subscription, [create a free account](https://azure.microsoft.com/free) before you begin.
55
+
- An Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md).
56
+
- An Azure Data Lake Storage (ADLS) Gen 2 storage account. See [Create an Azure Data Lake Storage (ADLS) Gen 2 storage account](../storage/blobs/create-data-lake-storage-account.md).
57
+
- To enable this feature:
58
+
1. Navigate to Azure Machine Learning studio UI.
59
+
2. Select **Manage preview features** (megaphone icon) among the icons on the top right side of the screen.
60
+
3. In **Managed preview feature** panel, toggle on **Run notebooks and jobs on managed Spark** feature.
@@ -132,62 +132,6 @@ The above script takes two arguments `--titanic_data` and `--wrangled_data`, whi
132
132
133
133
## Submit a standalone Spark job
134
134
135
-
# [Studio UI](#tab/studio-ui)
136
-
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for workspace default datastore `workspaceblobstore`. To submit a standalone Spark job using the Azure Machine Learning studio UI:
137
-
138
-
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job.png" alt-text="Expandable screenshot showing creation of a new Spark job in Azure Machine Learning studio UI.":::
139
-
140
-
1. In the left pane, select **+ New**.
141
-
2. Select **Spark job (preview)**.
142
-
3. On the **Compute** screen:
143
-
144
-
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" alt-text="Expandable screenshot showing compute selection screen for a new Spark job in Azure Machine Learning studio UI.":::
145
-
146
-
1. Under **Select compute type**, select **Spark automatic compute (Preview)** for Managed (Automatic) Spark compute.
147
-
2. Select **Virtual machine size**. The following instance types are currently supported:
148
-
-`Standard_E4s_v3`
149
-
-`Standard_E8s_v3`
150
-
-`Standard_E16s_v3`
151
-
-`Standard_E32s_v3`
152
-
-`Standard_E64s_v3`
153
-
3. Select **Spark runtime version** as **Spark 3.2**.
154
-
4. Select **Next**.
155
-
4. On the **Environment** screen, select **Next**.
156
-
5. On **Job settings** screen:
157
-
1. Provide a job **Name**, or use the job **Name**, which is generated by default.
158
-
2. Select an **Experiment name** from the dropdown menu.
159
-
3. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
2. Under **Path to code file to upload**, select **Browse**.
163
-
3. In the pop-up screen titled **Path selection**, select the path of code file `titanic.py` on the workspace default datastore `workspaceblobstore`.
164
-
4. Select **Save**.
165
-
5. Input `titanic.py` as the name of **Entry file** for the standalone job.
166
-
6. To add an input, select **+ Add input** under **Inputs** and
167
-
1. Enter **Input name** as `titanic_data`. The input should refer to this name later in the **Arguments**.
168
-
2. Select **Input type** as **Data**.
169
-
3. Select **Data type** as **File**.
170
-
4. Select **Data source** as **URI**.
171
-
5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
172
-
7. To add an input, select **+ Add output** under **Outputs** and
173
-
1. Enter **Output name** as `wrangled_data`. The output should refer to this name later in the **Arguments**.
174
-
2. Select **Output type** as **Folder**.
175
-
3. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
176
-
8. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
177
-
5. Under the **Spark configurations** section:
178
-
1. For **Executor size**:
179
-
1. Enter the number of executor **Cores** as 2 and executor **Memory (GB)** as 2.
180
-
2. For **Dynamically allocated executors**, select **Disabled**.
181
-
3. Enter the number of **Executor instances** as 2.
182
-
2. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
183
-
6. Select **Next**.
184
-
6. On the **Review** screen:
185
-
1. Review the job specification before submitting it.
186
-
2. Select **Create** to submit the standalone Spark job.
187
-
188
-
> [!NOTE]
189
-
> A standalone job submitted from the Studio UI using an Azure Machine Learning Managed (Automatic) Spark compute defaults to user identity passthrough for data access.
This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning Managed (Automatic) Spark compute, user identity passthrough, and input/output data URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`:
@@ -308,6 +252,63 @@ In the above code sample:
308
252
- `Standard_E32S_V3`
309
253
- `Standard_E64S_V3`
310
254
255
+
# [Studio UI](#tab/studio-ui)
256
+
First, upload the parameterized Python code `titanic.py` to the Azure Blob storage container for workspace default datastore `workspaceblobstore`. To submit a standalone Spark job using the Azure Machine Learning studio UI:
257
+
258
+
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job.png" alt-text="Expandable screenshot showing creation of a new Spark job in Azure Machine Learning studio UI.":::
259
+
260
+
1. In the left pane, select **+ New**.
261
+
2. Select **Spark job (preview)**.
262
+
3. On the **Compute** screen:
263
+
264
+
:::image type="content" source="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" lightbox="media/quickstart-spark-jobs/create-standalone-spark-job-compute.png" alt-text="Expandable screenshot showing compute selection screen for a new Spark job in Azure Machine Learning studio UI.":::
265
+
266
+
1. Under **Select compute type**, select **Spark automatic compute (Preview)** for Managed (Automatic) Spark compute.
267
+
2. Select **Virtual machine size**. The following instance types are currently supported:
268
+
- `Standard_E4s_v3`
269
+
- `Standard_E8s_v3`
270
+
- `Standard_E16s_v3`
271
+
- `Standard_E32s_v3`
272
+
- `Standard_E64s_v3`
273
+
3. Select **Spark runtime version** as **Spark 3.2**.
274
+
4. Select **Next**.
275
+
4. On the **Environment** screen, select **Next**.
276
+
5. On **Job settings** screen:
277
+
1. Provide a job **Name**, or use the job **Name**, which is generated by default.
278
+
2. Select an **Experiment name** from the dropdown menu.
279
+
3. Under **Add tags**, provide **Name** and **Value**, then select **Add**. Adding tags is optional.
2. Under **Path to code file to upload**, select **Browse**.
283
+
3. In the pop-up screen titled **Path selection**, select the path of code file `titanic.py` on the workspace default datastore `workspaceblobstore`.
284
+
4. Select **Save**.
285
+
5. Input `titanic.py` as the name of **Entry file** for the standalone job.
286
+
6. To add an input, select **+ Add input** under **Inputs** and
287
+
1. Enter **Input name** as `titanic_data`. The input should refer to this name later in the **Arguments**.
288
+
2. Select **Input type** as **Data**.
289
+
3. Select **Data type** as **File**.
290
+
4. Select **Data source** as **URI**.
291
+
5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for `titanic.csv` file in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
292
+
7. To add an input, select **+ Add output** under **Outputs** and
293
+
1. Enter **Output name** as `wrangled_data`. The output should refer to this name later in the **Arguments**.
294
+
2. Select **Output type** as **Folder**.
295
+
3. For **Output URI destination**, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in format `abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA>`.
296
+
8. Enter **Arguments** as `--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}`.
297
+
5. Under the **Spark configurations** section:
298
+
1. For **Executor size**:
299
+
1. Enter the number of executor **Cores** as 2 and executor **Memory (GB)** as 2.
300
+
2. For **Dynamically allocated executors**, select **Disabled**.
301
+
3. Enter the number of **Executor instances** as 2.
302
+
2. For **Driver size**, enter number of driver **Cores** as 1 and driver **Memory (GB)** as 2.
303
+
6. Select **Next**.
304
+
6. On the **Review** screen:
305
+
1. Review the job specification before submitting it.
306
+
2. Select **Create** to submit the standalone Spark job.
307
+
308
+
> [!NOTE]
309
+
> A standalone job submitted from the Studio UI using an Azure Machine Learning Managed (Automatic) Spark compute defaults to user identity passthrough for data access.
0 commit comments