You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-interactive-jobs.md
+82-94Lines changed: 82 additions & 94 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,18 +27,18 @@ Interactive training is supported on **Azure Machine Learning Compute Clusters**
27
27
28
28
## Prerequisites
29
29
- Review [getting started with training on Azure Machine Learning](./how-to-train-model.md).
30
-
- To use this feature in Azure Machine Learning Studio, enable the "Debug & monitor your training jobs" flight via the [preview panel](./how-to-enable-preview-features.md#how-do-i-enable-preview-features).
30
+
- To use this feature in Azure Machine Learning studio, enable the "Debug & monitor your training jobs" flight via the [preview panel](./how-to-enable-preview-features.md#how-do-i-enable-preview-features).
31
31
- To use **VS Code**, [follow this guide](how-to-setup-vs-code.md) to set up the Azure Machine Learning extension.
32
32
- Make sure your job environment has the `openssh-server` and `ipykernel ~=6.0` packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
33
33
- Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.
34
-
35
-
34
+
- To use SSH, you need an SSH key pair. You can use the `ssh-keygen -f "<filepath>"` command to generate a public and private key pair.
35
+
36
36
## Interact with your job container
37
37
38
38
By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.
39
39
40
40
### Enable during job submission
41
-
# [Azure Machine Learning Studio](#tab/ui)
41
+
# [Azure Machine Learning studio](#tab/ui)
42
42
1. Create a new job from the left navigation pane in the studio portal.
43
43
44
44
@@ -54,10 +54,10 @@ By specifying interactive applications at job creation, you can connect directly
54
54
:::image type="content" source="./media/interactive-jobs/sleep-command.png" alt-text="Screenshot of reviewing a drafted job and completing the creation.":::
55
55
56
56
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
57
-
* sleep 1s
58
-
* sleep 1m
59
-
* sleep 1h
60
-
* sleep 1d
57
+
* sleep 1s
58
+
* sleep 1m
59
+
* sleep 1h
60
+
* sleep 1d
61
61
62
62
You can also use the ```sleep infinity``` command that would keep the job alive indefinitely.
63
63
@@ -77,105 +77,94 @@ If you don't see the above options, make sure you have enabled the "Debug & moni
77
77
78
78
Note that you have to import the `JobService` class from the `azure.ai.ml.entities` package to configure interactive services via the SDKv2.
79
79
80
-
```python
81
-
command_job = command(...
82
-
code="./src", # local path where the code is stored
83
-
command="python main.py", # you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running
nodes="all"# For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
90
-
),
91
-
"My_vscode": JobService(
92
-
job_service_type="vs_code",
93
-
nodes="all"
94
-
),
95
-
"My_tensorboard": JobService(
96
-
job_service_type="tensor_board",
97
-
nodes="all",
98
-
properties={
99
-
"logDir": "output/tblogs"# relative path of Tensorboard logs (same as in your training script)
The `services` section specifies the training applications you want to interact with.
118
-
119
-
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
120
-
* sleep 1s
121
-
* sleep 1m
122
-
* sleep 1h
123
-
* sleep 1d
124
-
125
-
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
126
-
127
-
> [!NOTE]
128
-
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
80
+
```python
81
+
command_job = command(...
82
+
code="./src", # local path where the code is stored
83
+
command="python main.py", # you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running
nodes="all"# For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
89
+
),
90
+
"My_vscode": VsCodeJobService(
91
+
nodes="all"
92
+
),
93
+
"My_tensorboard": TensorBoardJobService(
94
+
nodes="all",
95
+
log_Dir="output/tblogs"# relative path of Tensorboard logs (same as in your training script)
The `services` section specifies the training applications you want to interact with.
109
+
110
+
You can put `sleep <specific time>` at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:
111
+
* sleep 1s
112
+
* sleep 1m
113
+
* sleep 1h
114
+
* sleep 1d
115
+
116
+
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
117
+
118
+
> [!NOTE]
119
+
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
129
120
130
121
2. Submit your training job. For more details on how to train with the Python SDKv2, check out this [article](./how-to-train-model.md).
131
122
132
123
# [Azure CLI](#tab/azurecli)
133
124
134
125
1. Create a job yaml `job.yaml` with below sample content. Make sure to replace `your compute name` with your own value. If you want to use custom environment, follow the examples in [this tutorial](how-to-manage-environments-v2.md) to create a custom environment.
135
-
```dotnetcli
136
-
code: src
137
-
command:
138
-
python train.py
139
-
# you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running.
nodes: all# For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
146
-
my_tensor_board:
147
-
job_service_type: tensor_board
148
-
properties:
149
-
logDir: "output/tblogs"# relative path of Tensorboard logs (same as in your training script)
150
-
nodes: all
151
-
my_jupyter_lab:
152
-
job_service_type: jupyter_lab
153
-
nodes: all
154
-
my_ssh:
126
+
```dotnetcli
127
+
code: src
128
+
command:
129
+
python train.py
130
+
# you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running.
nodes: all # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
137
+
my_tensor_board:
138
+
job_service_type: tensor_board
139
+
log_dir: "output/tblogs" # relative path of Tensorboard logs (same as in your training script)
140
+
nodes: all
141
+
my_jupyter_lab:
142
+
job_service_type: jupyter_lab
143
+
nodes: all
144
+
my_ssh:
155
145
job_service_type: ssh
156
-
properties:
157
-
sshPublicKeys: <paste the entire pub key content>
146
+
ssh_public_keys: <paste the entire pub key content>
158
147
nodes: all
159
-
```
160
-
The `services` section specifies the training applications you want to interact with.
148
+
```
161
149
162
-
You can put `sleep <specific time>` at the end of the command to specify the amount of time you want to reserve the compute resource. The format follows:
163
-
* sleep 1s
164
-
* sleep 1m
165
-
* sleep 1h
166
-
* sleep 1d
150
+
The `services` section specifies the training applications you want to interact with.
167
151
168
-
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
169
-
170
-
> [!NOTE]
171
-
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
152
+
You can put `sleep <specific time>` at the end of the command to specify the amount of time you want to reserve the compute resource. The format follows:
153
+
* sleep 1s
154
+
* sleep 1m
155
+
* sleep 1h
156
+
* sleep 1d
157
+
158
+
You can also use the `sleep infinity` command that would keep the job alive indefinitely.
159
+
160
+
> [!NOTE]
161
+
> If you use `sleep infinity`, you will need to manually [cancel the job](./how-to-interactive-jobs.md#end-job) to let go of the compute resource (and stop billing).
172
162
173
163
2. Run command `az ml job create --file <path to your job yaml file> --workspace-name <your workspace name> --resource-group <your resource group name> --subscription <sub-id> `to submit your training job. For more details on running a job via CLIv2, check out this [article](./how-to-train-model.md).
174
164
175
165
---
176
-
177
166
### Connect to endpoints
178
-
# [Azure Machine Learning Studio](#tab/ui)
167
+
# [Azure Machine Learning studio](#tab/ui)
179
168
To interact with your running job, click the button **Debug and monitor** on the job details page.
180
169
181
170
:::image type="content" source="media/interactive-jobs/debug-and-monitor.png" alt-text="Screenshot of interactive jobs debug and monitor panel location.":::
@@ -206,7 +195,6 @@ You can find the reference documentation for these commands [here](/cli/azure/ml
206
195
You can access the applications only when they are in **Running** status and only the **job owner** is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with by passing in the node index.
207
196
208
197
---
209
-
210
198
### Interact with the applications
211
199
When you click on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from **system_logs->interactive_capability** under **Outputs + logs** tab.
212
200
@@ -258,4 +246,4 @@ To submit a job with a debugger attached and the execution paused, you can use d
258
246
259
247
## Next steps
260
248
261
-
+ Learn more about [how and where to deploy a model](./how-to-deploy-online-endpoints.md).
249
+
+ Learn more about [how and where to deploy a model](./how-to-deploy-online-endpoints.md).
0 commit comments