You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-interactive-jobs.md
+32-30Lines changed: 32 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Interact with your jobs (debug & monitor)
2
+
title: Interact with your jobs (debug and monitor)
3
3
titleSuffix: Azure Machine Learning
4
4
description: Debug or monitor your Machine Learning job as it runs on AzureML compute with your training application of choice.
5
5
services: machine-learning
@@ -16,39 +16,40 @@ ms.date: 03/15/2022
16
16
17
17
# Debug jobs and monitor training progress
18
18
19
-
Machine learning model training is usually an iterative process and requires significant experimentation. With the AzureML interactive job experience, data scientists can use the AzureML Python SDKv2, AzureML CLIv2 or AzureML Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including **JupyterLab, TensorBoard, VS Code** or by connecting to the job container directly via **SSH**.
19
+
Machine learning model training is usually an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, data scientists can use the Azure Machine Learning Python SDKv2, Azure Machine Learning CLIv2 or the Azure Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including **JupyterLab, TensorBoard, VS Code** or by connecting to the job container directly via **SSH**.
20
20
21
-
Interactive training is supported on **AzureML Compute Cluster** and **Azure Arc-enabled Kubernetes Cluster**.
21
+
Interactive training is supported on **Azure Machine Learning Compute Clusters** and **Azure Arc-enabled Kubernetes Cluster**.
22
22
23
-
## Pre-requisites
24
-
- Review [getting started with training on AzureML](./how-to-train-model.md).
23
+
## Prerequisites
24
+
- Review [getting started with training on Azure Machine Learning](./how-to-train-model.md).
25
25
- To use **VS Code**, [follow this guide](how-to-setup-vs-code.md) to set up the Azure Machine Learning extension.
26
-
- Make sure your job environment has the `openssh-server` and `ipykernel ~=6.0` packages installed (all AzureML curated training environments have these packages installed by default).
27
-
- Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.
26
+
- Make sure your job environment has the `openssh-server` and `ipykernel ~=6.0` packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
27
+
- Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.
By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.
34
37
35
38
### Enable during job submission
36
-
# [AzureML Studio](#tab/ui)
39
+
# [Azure Machine Learning Studio](#tab/ui)
37
40
1. Create a new job from the left navigation pane in the studio portal.
2. Choose `Compute cluster` or `Attached compute` (Kubernetes) as the compute type, choose the compute target, and specify how many nodes you need in `Instance count`.
:::image type="content" source="./media/interactive-jobs/sleep-command.png" alt-text="Screenshot of reviewing a drafted job and completing the creation.":::
52
53
53
54
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
54
55
* sleep 1s
@@ -63,7 +64,7 @@ By specifying interactive applications at job creation, you can connect directly
63
64
64
65
5. Select the training applications you want to use to interact with the job.
:::image type="content" source="./media/interactive-jobs/select-training-apps.png" alt-text="Screenshot of selecting a training application for the user to use for a job.":::
67
68
68
69
6. Review and create the job.
69
70
@@ -171,21 +172,22 @@ By specifying interactive applications at job creation, you can connect directly
171
172
---
172
173
173
174
### Connect to endpoints
174
-
# [AzureML Studio](#tab/ui)
175
+
# [Azure Machine Learning Studio](#tab/ui)
175
176
To interact with your running job, click the button **Debug and monitor** on the job details page.
:::image type="content"source="media/interactive-jobs/debug-and-monitor.png" alt-text="Screenshot of interactive jobs debug and monitor panel location.":::
179
+
178
180
179
181
Clicking the applications in the panel opens a new tab for the applications. You can access the applications only when they are in**Running** status and only the **job owner**is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with.
:::image type="content"source="media/interactive-jobs/interactive-jobs-right-panel.png" alt-text="Screenshot of interactive jobs right panel information. Information content will vary depending on the users data":::
182
184
183
185
It might take a few minutes to start the job and the training applications specified during job creation.
184
186
185
187
# [Python SDK](#tab/python)
186
188
- Once the job is submitted, you can use `ml_client.jobs.show_services("<job name>", <compute node index>)` to view the interactive service endpoints.
187
189
188
-
- To connect via SSH to the container where the job is running, run the command `az ml job connect-ssh --name <job-name>--node-index <compute node index>--private-key-file-path <path to private key>`. To set up the AzureML CLIv2, follow this [guide](./how-to-configure-cli.md).
190
+
- To connect via SSH to the container where the job is running, run the command `az ml job connect-ssh --name <job-name>--node-index <compute node index>--private-key-file-path <path to private key>`. To set up the Azure Machine Learning CLIv2, follow this [guide](./how-to-configure-cli.md).
189
191
190
192
You can find the reference documentation for the SDKv2 [here](/sdk/azure/ml).
191
193
@@ -205,45 +207,45 @@ You can access the applications only when they are in **Running** status and onl
205
207
### Interact with the applications
206
208
When you click on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from **system_logs->interactive_capability** under **Outputs + logs** tab.
:::image type="content"source="./media/interactive-jobs/interactive-jobs-logs.png" alt-text="Screenshot of interactive jobs interactive logs panel location.":::
209
211
210
212
- You can open a terminal from Jupyter Lab and start interacting within the job container. You can also directly iterate on your training script with Jupyter Lab.
:::image type="content"source="./media/interactive-jobs/jupyter-lab.png" alt-text="Screenshot of interactive jobs Jupyter lab content panel.":::
213
215
214
216
- You can also interact with the job container within VS Code. To attach a debugger to a job during job submission and pause execution, [navigate here](./how-to-interactive-jobs.md#attach-a-debugger-to-a-job).
:::image type="content"source="./media/interactive-jobs/vs-code-open.png" alt-text="Screenshot of interactive jobs VS Code panel when first opened. This shows the sample python file that was created to print two lines.":::
217
219
218
220
- If you have logged tensorflow events for your job, you can use TensorBoard to monitor the metrics when your job is running.
:::image type="content"source="./media/interactive-jobs/tensorboard-open.png" alt-text="Screenshot of interactive jobs tensorboard panel when first opened. This information will vary depending upon customer data":::
221
223
222
224
### End job
223
225
Once you're done with the interactive training, you can also go to the job details page to cancel the job which will release the compute resource. Alternatively, use `az ml job cancel -n <your job name>` in the CLI or `ml_client.job.cancel("<job name>")` in the SDK.
:::image type="content"source="./media/interactive-jobs/cancel-job.png" alt-text="Screenshot of interactive jobs cancel job option and its location for user selection":::
226
228
227
229
## Attach a debugger to a job
228
-
To submit a job with a debugger attached and the execution paused, you can use debugpy &VS Code (`debugpy` must be installed in your job environment).
230
+
To submit a job with a debugger attached and the execution paused, you can use debugpy andVS Code (`debugpy` must be installed in your job environment).
229
231
230
232
1. During job submission (either through the UI, the CLIv2 or the SDKv2) use the debugpy command to run your python script. For example, the below screenshot shows a sample command that uses debugpy to attach the debugger for a tensorflow script (`tfevents.py` can be replaced with the name of your training script).
:::image type="content"source="./media/interactive-jobs/open-debugger.png" alt-text="Screenshot of interactive jobs location of open debugger on the left side panel":::
237
239
238
240
3. Use the "Remote Attach" debug configuration to attach to the submitted job andpassin the path and port you configured in your job submission command. You can also find this information on the job details page.
:::image type="content"source="./media/interactive-jobs/set-breakpoints.png" alt-text="Screenshot of location of an example breakpoint that is set in the Visual Studio Code editor":::
0 commit comments