Skip to content

Commit 4688434

Browse files
committed
PR edits
1 parent 5177d0e commit 4688434

File tree

4 files changed

+35
-30
lines changed

4 files changed

+35
-30
lines changed

articles/machine-learning/how-to-interactive-jobs.md

Lines changed: 32 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Interact with your jobs (debug & monitor)
2+
title: Interact with your jobs (debug and monitor)
33
titleSuffix: Azure Machine Learning
44
description: Debug or monitor your Machine Learning job as it runs on AzureML compute with your training application of choice.
55
services: machine-learning
@@ -16,39 +16,40 @@ ms.date: 03/15/2022
1616

1717
# Debug jobs and monitor training progress
1818

19-
Machine learning model training is usually an iterative process and requires significant experimentation. With the AzureML interactive job experience, data scientists can use the AzureML Python SDKv2, AzureML CLIv2 or AzureML Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including **JupyterLab, TensorBoard, VS Code** or by connecting to the job container directly via **SSH**.
19+
Machine learning model training is usually an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, data scientists can use the Azure Machine Learning Python SDKv2, Azure Machine Learning CLIv2 or the Azure Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including **JupyterLab, TensorBoard, VS Code** or by connecting to the job container directly via **SSH**.
2020

21-
Interactive training is supported on **AzureML Compute Cluster** and **Azure Arc-enabled Kubernetes Cluster**.
21+
Interactive training is supported on **Azure Machine Learning Compute Clusters** and **Azure Arc-enabled Kubernetes Cluster**.
2222

23-
## Pre-requisites
24-
- Review [getting started with training on AzureML](./how-to-train-model.md).
23+
## Prerequisites
24+
- Review [getting started with training on Azure Machine Learning](./how-to-train-model.md).
2525
- To use **VS Code**, [follow this guide](how-to-setup-vs-code.md) to set up the Azure Machine Learning extension.
26-
- Make sure your job environment has the `openssh-server` and `ipykernel ~=6.0` packages installed (all AzureML curated training environments have these packages installed by default).
27-
- Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.
26+
- Make sure your job environment has the `openssh-server` and `ipykernel ~=6.0` packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
27+
- Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.
2828

29-
![screenshot supported-distribution-types](media/interactive-jobs/supported-distribution-types.png)
29+
The supported distribution types are:
30+
* PyTorch
31+
* Mpi
32+
* TensorFlow
3033

3134
## Interact with your job container
3235

3336
By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.
3437

3538
### Enable during job submission
36-
# [AzureML Studio](#tab/ui)
39+
# [Azure Machine Learning Studio](#tab/ui)
3740
1. Create a new job from the left navigation pane in the studio portal.
3841

39-
![screenshot select-job-ui](media/interactive-jobs/create-job.png)
4042

4143
2. Choose `Compute cluster` or `Attached compute` (Kubernetes) as the compute type, choose the compute target, and specify how many nodes you need in `Instance count`.
4244

43-
![screenshot select-compute-ui](media/interactive-jobs/select-compute.png)
45+
:::image type="content" source="./media/interactive-jobs/select-compute.png" alt-text="Screenshot of selecting a compute location for a job.":::
4446

4547
3. Follow the wizard to choose the environment you want to start the job.
4648

47-
![screenshot select-environment-ui](media/interactive-jobs/select-environment.png)
4849

4950
4. In `Job settings` step, add your training code (and input/output data) and reference it in your command to make sure it's mounted to your job.
5051

51-
![screenshot set-command](media/interactive-jobs/sleep-command.png)
52+
:::image type="content" source="./media/interactive-jobs/sleep-command.png" alt-text="Screenshot of reviewing a drafted job and completing the creation.":::
5253

5354
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
5455
* sleep 1s
@@ -63,7 +64,7 @@ By specifying interactive applications at job creation, you can connect directly
6364
6465
5. Select the training applications you want to use to interact with the job.
6566

66-
![screenshot select-apps](media/interactive-jobs/select-training-apps.png)
67+
:::image type="content" source="./media/interactive-jobs/select-training-apps.png" alt-text="Screenshot of selecting a training application for the user to use for a job.":::
6768

6869
6. Review and create the job.
6970

@@ -171,21 +172,22 @@ By specifying interactive applications at job creation, you can connect directly
171172
---
172173

173174
### Connect to endpoints
174-
# [AzureML Studio](#tab/ui)
175+
# [Azure Machine Learning Studio](#tab/ui)
175176
To interact with your running job, click the button **Debug and monitor** on the job details page.
176177

177-
![screenshot debug-and-monitor](media/interactive-jobs/debug-and-monitor.png)
178+
:::image type="content" source="media/interactive-jobs/debug-and-monitor.png" alt-text="Screenshot of interactive jobs debug and monitor panel location.":::
179+
178180

179181
Clicking the applications in the panel opens a new tab for the applications. You can access the applications only when they are in **Running** status and only the **job owner** is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with.
180182

181-
![screenshot ij-right-panel](media/interactive-jobs/ij-right-panel.png)
183+
:::image type="content" source="media/interactive-jobs/interactive-jobs-right-panel.png" alt-text="Screenshot of interactive jobs right panel information. Information content will vary depending on the users data":::
182184

183185
It might take a few minutes to start the job and the training applications specified during job creation.
184186

185187
# [Python SDK](#tab/python)
186188
- Once the job is submitted, you can use `ml_client.jobs.show_services("<job name>", <compute node index>)` to view the interactive service endpoints.
187189

188-
- To connect via SSH to the container where the job is running, run the command `az ml job connect-ssh --name <job-name> --node-index <compute node index> --private-key-file-path <path to private key>`. To set up the AzureML CLIv2, follow this [guide](./how-to-configure-cli.md).
190+
- To connect via SSH to the container where the job is running, run the command `az ml job connect-ssh --name <job-name> --node-index <compute node index> --private-key-file-path <path to private key>`. To set up the Azure Machine Learning CLIv2, follow this [guide](./how-to-configure-cli.md).
189191

190192
You can find the reference documentation for the SDKv2 [here](/sdk/azure/ml).
191193

@@ -205,45 +207,45 @@ You can access the applications only when they are in **Running** status and onl
205207
### Interact with the applications
206208
When you click on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from **system_logs->interactive_capability** under **Outputs + logs** tab.
207209

208-
![screenshot check-logs](./media/interactive-jobs/ij-logs.png)
210+
:::image type="content" source="./media/interactive-jobs/interactive-jobs-logs.png" alt-text="Screenshot of interactive jobs interactive logs panel location.":::
209211

210212
- You can open a terminal from Jupyter Lab and start interacting within the job container. You can also directly iterate on your training script with Jupyter Lab.
211213

212-
![screenshot jupyter-lab](./media/interactive-jobs/jupyter-lab.png)
214+
:::image type="content" source="./media/interactive-jobs/jupyter-lab.png" alt-text="Screenshot of interactive jobs Jupyter lab content panel.":::
213215

214216
- You can also interact with the job container within VS Code. To attach a debugger to a job during job submission and pause execution, [navigate here](./how-to-interactive-jobs.md#attach-a-debugger-to-a-job).
215217

216-
![screenshot vs-code-open](./media/interactive-jobs/vs-code-open.png)
218+
:::image type="content" source="./media/interactive-jobs/vs-code-open.png" alt-text="Screenshot of interactive jobs VS Code panel when first opened. This shows the sample python file that was created to print two lines.":::
217219

218220
- If you have logged tensorflow events for your job, you can use TensorBoard to monitor the metrics when your job is running.
219221

220-
![screenshot tensorboard-open](./media/interactive-jobs/tensorboard-open.png)
222+
:::image type="content" source="./media/interactive-jobs/tensorboard-open.png" alt-text="Screenshot of interactive jobs tensorboard panel when first opened. This information will vary depending upon customer data":::
221223

222224
### End job
223225
Once you're done with the interactive training, you can also go to the job details page to cancel the job which will release the compute resource. Alternatively, use `az ml job cancel -n <your job name>` in the CLI or `ml_client.job.cancel("<job name>")` in the SDK.
224226

225-
![screenshot cancel-job](./media/interactive-jobs/cancel-job.png)
227+
:::image type="content" source="./media/interactive-jobs/cancel-job.png" alt-text="Screenshot of interactive jobs cancel job option and its location for user selection":::
226228

227229
## Attach a debugger to a job
228-
To submit a job with a debugger attached and the execution paused, you can use debugpy & VS Code (`debugpy` must be installed in your job environment).
230+
To submit a job with a debugger attached and the execution paused, you can use debugpy and VS Code (`debugpy` must be installed in your job environment).
229231

230232
1. During job submission (either through the UI, the CLIv2 or the SDKv2) use the debugpy command to run your python script. For example, the below screenshot shows a sample command that uses debugpy to attach the debugger for a tensorflow script (`tfevents.py` can be replaced with the name of your training script).
231233

232-
![screenshot use-debugpy](./media/interactive-jobs/use-debugpy.png)
234+
:::image type="content" source="./media/interactive-jobs/use-debugpy.png" alt-text="Screenshot of interactive jobs configuration of debugpy":::
233235

234236
2. Once the job has been submitted, [connect to the VS Code](./how-to-interactive-jobs.md#connect-to-endpoints), and click on the in-built debugger.
235237

236-
![screenshot open-debugger](./media/interactive-jobs/open-debugger.png)
238+
:::image type="content" source="./media/interactive-jobs/open-debugger.png" alt-text="Screenshot of interactive jobs location of open debugger on the left side panel":::
237239

238240
3. Use the "Remote Attach" debug configuration to attach to the submitted job and pass in the path and port you configured in your job submission command. You can also find this information on the job details page.
239241

240-
![screenshot debug-path-and-port](./media/interactive-jobs/debug-path-and-port.png)
241-
242-
![screenshot remote-attach](./media/interactive-jobs/remote-attach.png)
242+
:::image type="content" source="./media/interactive-jobs/debug-path-and-port.png" alt-text="Screenshot of interactive jobs completed jobs":::
243+
244+
:::image type="content" source="./media/interactive-jobs/remote-attach.png" alt-text="Screenshot of interactive jobs add a remote attach button":::
243245

244246
4. Set breakpoints and walk through your job execution as you would in your local debugging workflow.
245247

246-
![screenshot set-breakpoints](./media/interactive-jobs/set-breakpoints.png)
248+
:::image type="content" source="./media/interactive-jobs/set-breakpoints.png" alt-text="Screenshot of location of an example breakpoint that is set in the Visual Studio Code editor":::
247249

248250

249251
> [!NOTE]

articles/machine-learning/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -480,6 +480,9 @@
480480
- name: Use automated ML with Databricks
481481
displayName: automl
482482
href: how-to-configure-databricks-automl-environment.md
483+
- name: Debug jobs and monitor training progress
484+
displayName: automl
485+
href: how-to-interactive-jobs.md
483486
- name: Prep image data for computer vision models (Python)
484487
displayName: SDK, automl, image, datasets, conversion scripts, schema, image model
485488
href: how-to-prepare-datasets-for-automl-images.md

0 commit comments

Comments
 (0)