Skip to content

Commit a14f316

Browse files
authored
Merge pull request #231769 from joburges/docs-editor/how-to-interactive-jobs-1679517800
Update how-to-interactive-jobs.md
2 parents 4bf4ef5 + ca661eb commit a14f316

File tree

1 file changed

+82
-94
lines changed

1 file changed

+82
-94
lines changed

articles/machine-learning/how-to-interactive-jobs.md

Lines changed: 82 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -27,18 +27,18 @@ Interactive training is supported on **Azure Machine Learning Compute Clusters**
2727

2828
## Prerequisites
2929
- Review [getting started with training on Azure Machine Learning](./how-to-train-model.md).
30-
- To use this feature in Azure Machine Learning Studio, enable the "Debug & monitor your training jobs" flight via the [preview panel](./how-to-enable-preview-features.md#how-do-i-enable-preview-features).
30+
- To use this feature in Azure Machine Learning studio, enable the "Debug & monitor your training jobs" flight via the [preview panel](./how-to-enable-preview-features.md#how-do-i-enable-preview-features).
3131
- To use **VS Code**, [follow this guide](how-to-setup-vs-code.md) to set up the Azure Machine Learning extension.
3232
- Make sure your job environment has the `openssh-server` and `ipykernel ~=6.0` packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
3333
- Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.
34-
35-
34+
- To use SSH, you need an SSH key pair. You can use the `ssh-keygen -f "<filepath>"` command to generate a public and private key pair.
35+
3636
## Interact with your job container
3737

3838
By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.
3939

4040
### Enable during job submission
41-
# [Azure Machine Learning Studio](#tab/ui)
41+
# [Azure Machine Learning studio](#tab/ui)
4242
1. Create a new job from the left navigation pane in the studio portal.
4343

4444

@@ -54,10 +54,10 @@ By specifying interactive applications at job creation, you can connect directly
5454
:::image type="content" source="./media/interactive-jobs/sleep-command.png" alt-text="Screenshot of reviewing a drafted job and completing the creation.":::
5555

5656
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
57-
* sleep 1s
58-
* sleep 1m
59-
* sleep 1h
60-
* sleep 1d
57+
* sleep 1s
58+
* sleep 1m
59+
* sleep 1h
60+
* sleep 1d
6161

6262
You can also use the ```sleep infinity``` command that would keep the job alive indefinitely.
6363

@@ -77,105 +77,94 @@ If you don't see the above options, make sure you have enabled the "Debug & moni
7777

7878
Note that you have to import the `JobService` class from the `azure.ai.ml.entities` package to configure interactive services via the SDKv2.
7979

80-
```python
81-
command_job = command(...
82-
code="./src", # local path where the code is stored
83-
command="python main.py", # you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running
84-
environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
85-
compute="<name-of-compute>",
86-
services={
87-
"My_jupyterlab": JobService(
88-
job_service_type="jupyter_lab",
89-
nodes="all" # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
90-
),
91-
"My_vscode": JobService(
92-
job_service_type="vs_code",
93-
nodes="all"
94-
),
95-
"My_tensorboard": JobService(
96-
job_service_type="tensor_board",
97-
nodes="all",
98-
properties={
99-
"logDir": "output/tblogs" # relative path of Tensorboard logs (same as in your training script)
100-
}
101-
),
102-
"My_ssh": JobService(
103-
job_service_type="ssh",
104-
sshPublicKeys="<add-public-key>",
105-
nodes="all"
106-
properties={
107-
"sshPublicKeys":"<add-public-key>"
108-
}
109-
),
110-
}
111-
)
112-
113-
# submit the command
114-
returned_job = ml_client.jobs.create_or_update(command_job)
115-
```
116-
117-
The `services` section specifies the training applications you want to interact with.
118-
119-
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
120-
* sleep 1s
121-
* sleep 1m
122-
* sleep 1h
123-
* sleep 1d
124-
125-
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
126-
127-
> [!NOTE]
128-
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
80+
```python
81+
command_job = command(...
82+
code="./src", # local path where the code is stored
83+
command="python main.py", # you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running
84+
environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
85+
compute="<name-of-compute>",
86+
services={
87+
"My_jupyterlab": JupyterLabJobService(
88+
nodes="all" # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
89+
),
90+
"My_vscode": VsCodeJobService(
91+
nodes="all"
92+
),
93+
"My_tensorboard": TensorBoardJobService(
94+
nodes="all",
95+
log_Dir="output/tblogs" # relative path of Tensorboard logs (same as in your training script)
96+
),
97+
"My_ssh": SshJobService(
98+
ssh_Public_Keys="<add-public-key>",
99+
nodes="all"
100+
),
101+
}
102+
)
103+
104+
# submit the command
105+
returned_job = ml_client.jobs.create_or_update(command_job)
106+
```
107+
108+
The `services` section specifies the training applications you want to interact with.
109+
110+
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
111+
* sleep 1s
112+
* sleep 1m
113+
* sleep 1h
114+
* sleep 1d
115+
116+
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
117+
118+
> [!NOTE]
119+
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
129120
130121
2. Submit your training job. For more details on how to train with the Python SDKv2, check out this [article](./how-to-train-model.md).
131122

132123
# [Azure CLI](#tab/azurecli)
133124

134125
1. Create a job yaml `job.yaml` with below sample content. Make sure to replace `your compute name` with your own value. If you want to use custom environment, follow the examples in [this tutorial](how-to-manage-environments-v2.md) to create a custom environment.
135-
```dotnetcli
136-
code: src
137-
command:
138-
python train.py
139-
# you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running.
140-
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:41
141-
compute: azureml:<your compute name>
142-
services:
143-
my_vs_code:
144-
job_service_type: vs_code
145-
nodes: all # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
146-
my_tensor_board:
147-
job_service_type: tensor_board
148-
properties:
149-
logDir: "output/tblogs" # relative path of Tensorboard logs (same as in your training script)
150-
nodes: all
151-
my_jupyter_lab:
152-
job_service_type: jupyter_lab
153-
nodes: all
154-
my_ssh:
126+
```dotnetcli
127+
code: src
128+
command:
129+
python train.py
130+
# you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running.
131+
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:41
132+
compute: azureml:<your compute name>
133+
services:
134+
my_vs_code:
135+
job_service_type: vs_code
136+
nodes: all # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
137+
my_tensor_board:
138+
job_service_type: tensor_board
139+
log_dir: "output/tblogs" # relative path of Tensorboard logs (same as in your training script)
140+
nodes: all
141+
my_jupyter_lab:
142+
job_service_type: jupyter_lab
143+
nodes: all
144+
my_ssh:
155145
job_service_type: ssh
156-
properties:
157-
sshPublicKeys: <paste the entire pub key content>
146+
ssh_public_keys: <paste the entire pub key content>
158147
nodes: all
159-
```
160-
The `services` section specifies the training applications you want to interact with.
148+
```
161149

162-
You can put `sleep <specific time>` at the end of the command to specify the amount of time you want to reserve the compute resource. The format follows:
163-
* sleep 1s
164-
* sleep 1m
165-
* sleep 1h
166-
* sleep 1d
150+
The `services` section specifies the training applications you want to interact with.
167151

168-
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
169-
170-
> [!NOTE]
171-
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
152+
You can put `sleep <specific time>` at the end of the command to specify the amount of time you want to reserve the compute resource. The format follows:
153+
* sleep 1s
154+
* sleep 1m
155+
* sleep 1h
156+
* sleep 1d
157+
158+
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
159+
160+
> [!NOTE]
161+
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
172162
173163
2. Run command `az ml job create --file <path to your job yaml file> --workspace-name <your workspace name> --resource-group <your resource group name> --subscription <sub-id> `to submit your training job. For more details on running a job via CLIv2, check out this [article](./how-to-train-model.md).
174164

175165
---
176-
177166
### Connect to endpoints
178-
# [Azure Machine Learning Studio](#tab/ui)
167+
# [Azure Machine Learning studio](#tab/ui)
179168
To interact with your running job, click the button **Debug and monitor** on the job details page.
180169

181170
:::image type="content" source="media/interactive-jobs/debug-and-monitor.png" alt-text="Screenshot of interactive jobs debug and monitor panel location.":::
@@ -206,7 +195,6 @@ You can find the reference documentation for these commands [here](/cli/azure/ml
206195
You can access the applications only when they are in **Running** status and only the **job owner** is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with by passing in the node index.
207196

208197
---
209-
210198
### Interact with the applications
211199
When you click on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from **system_logs->interactive_capability** under **Outputs + logs** tab.
212200

@@ -258,4 +246,4 @@ To submit a job with a debugger attached and the execution paused, you can use d
258246
259247
## Next steps
260248

261-
+ Learn more about [how and where to deploy a model](./how-to-deploy-online-endpoints.md).
249+
+ Learn more about [how and where to deploy a model](./how-to-deploy-online-endpoints.md).

0 commit comments

Comments
 (0)