Skip to content

Commit d99509d

Browse files
Merge pull request #6175 from s-polly/stp-k8s
Freshness check, k8s instance types
2 parents a8d34f6 + 4cd207a commit d99509d

File tree

1 file changed

+36
-37
lines changed

1 file changed

+36
-37
lines changed

articles/machine-learning/how-to-manage-kubernetes-instance-types.md

Lines changed: 36 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,31 @@
11
---
22
title: Create and manage instance types for efficient utilization of compute resources
3-
description: Learn about what instance types are, how to create and manage them, and what the benefits of using them are.
3+
description: Learn what instance types are, how to create and manage them, and the benefits of using them.
44
titleSuffix: Azure Machine Learning
55
author: s-polly
66
ms.author: scottpolly
7-
ms.reviewer: bozhlin
7+
ms.reviewer: namanjoshi
88
ms.service: azure-machine-learning
99
ms.subservice: core
10-
ms.date: 01/09/2024
10+
ms.date: 07/23/2025
1111
ms.topic: how-to
1212
ms.custom: build-spring-2022, cliv2, sdkv2
1313
---
1414

1515
# Create and manage instance types for efficient utilization of compute resources
1616

17-
Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For example, in an Azure virtual machine, an instance type is `STANDARD_D2_V3`. This article teaches you how to create and manage instance types for your computation requirements.
17+
Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For example, in an Azure virtual machine, an instance type is `STANDARD_D2_V3`. This article shows you how to create and manage instance types for your computation requirements.
1818

19-
In Kubernetes clusters, instance types are represented in a custom resource definition (CRD) that's installed with the Azure Machine Learning extension. Two elements in the Azure Machine Learning extension represent the instance types:
19+
In Kubernetes clusters, instance types are represented as a custom resource definition (CRD) installed with the Azure Machine Learning extension. Two elements in the Azure Machine Learning extension represent instance types:
2020

21-
- Use [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) to specify which node a pod should run on. The node must have a corresponding label.
22-
- In the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section, you can set the compute resources (CPU, memory, and NVIDIA GPU) for the pod.
21+
- **nodeSelector**: Use [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) to specify which node a pod should run on. The node must have a corresponding label.
22+
- **resources**: In the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section, you can set the compute resources (CPU, memory, and NVIDIA GPU) for the pod.
2323

24-
If you [specify a nodeSelector field when deploying the Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings), the `nodeSelector` field will be applied to all instance types. This means that:
24+
If you [specify a nodeSelector field when deploying the Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings), the `nodeSelector` field applies to all instance types. This means:
2525

2626
- For each instance type that you create, the specified `nodeSelector` field should be a subset of the extension-specified `nodeSelector` field.
27-
- If you use an instance type with `nodeSelector`, the workload will run on any node that matches both the extension-specified `nodeSelector` field and the instance-type-specified `nodeSelector` field.
28-
- If you use an instance type without a `nodeSelector` field, the workload will run on any node that matches the extension-specified `nodeSelector` field.
27+
- If you use an instance type with `nodeSelector`, the workload runs on any node that matches both the extension-specified `nodeSelector` field and the instance-type-specified `nodeSelector` field.
28+
- If you use an instance type without a `nodeSelector` field, the workload runs on any node that matches the extension-specified `nodeSelector` field.
2929

3030
## Create a default instance type
3131

@@ -44,11 +44,11 @@ resources:
4444
4545
If you don't apply a `nodeSelector` field, the pod can be scheduled on any node. The workload's pods are assigned default resources with 0.1 CPU cores, 2 GB of memory, and 0 GPUs for the request. The resources that the workload's pods use are limited to 2 CPU cores and 8 GB of memory.
4646

47-
The default instance type purposefully uses few resources. To ensure that all machine learning workloads run with appropriate resources (for example, GPU resource), we highly recommend that you [create custom instance types](#create-a-custom-instance-type).
47+
The default instance type purposefully uses minimal resources. To ensure that all machine learning workloads run with appropriate resources (for example, GPU resources), we highly recommend that you [create custom instance types](#create-a-custom-instance-type).
4848

4949
Keep in mind the following points about the default instance type:
5050

51-
- `defaultinstancetype` doesn't appear as an `InstanceType` custom resource in the cluster when you're running the command ```kubectl get instancetype```, but it does appear in all clients (UI, Azure CLI, SDK).
51+
- `defaultinstancetype` doesn't appear as an `InstanceType` custom resource in the cluster when you run the command `kubectl get instancetype`, but it does appear in all clients (UI, Azure CLI, SDK).
5252
- `defaultinstancetype` can be overridden with the definition of a custom instance type that has the same name.
5353

5454
## Create a custom instance type
@@ -79,25 +79,25 @@ spec:
7979
memory: "1500Mi"
8080
```
8181

82-
The preceding code creates an instance type with the labeled behavior:
82+
The preceding code creates an instance type with the following behavior:
8383

8484
- Pods are scheduled only on nodes that have the label `mylabel: mylabelvalue`.
8585
- Pods are assigned resource requests of `700m` for CPU and `1500Mi` for memory.
8686
- Pods are assigned resource limits of `1` for CPU, `2Gi` for memory, and `1` for NVIDIA GPU.
8787

88-
Creation of custom instance types must meet the following parameters and definition rules, or it fails:
88+
Custom instance type creation must meet the following parameters and definition rules, or it fails:
8989

9090
| Parameter | Required or optional | Description |
9191
| --- | --- | --- |
92-
| `name` | Required | String values, which must be unique in a cluster.|
93-
| `CPU request` | Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
94-
| `Memory request` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 mebibytes (MiB).|
95-
| `CPU limit` | Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
96-
| `Memory limit` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
97-
| `GPU` | Optional | Integer values, which can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). |
92+
| `name` | Required | String values that must be unique in a cluster.|
93+
| `CPU request` | Required | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
94+
| `Memory request` | Required | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 mebibytes (MiB).|
95+
| `CPU limit` | Required | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
96+
| `Memory limit` | Required | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
97+
| `GPU` | Optional | Integer values that can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). |
9898
| `nodeSelector` | Optional | Map of string keys and values. |
9999

100-
It's also possible to create multiple instance types at once:
100+
You can also create multiple instance types at once:
101101

102102
```bash
103103
kubectl apply -f my_instance_type_list.yaml
@@ -142,8 +142,7 @@ If you submit a training or inference workload without an instance type, it uses
142142

143143
### [Azure CLI](#tab/select-instancetype-to-trainingjob-with-cli)
144144

145-
To select an instance type for a training job by using the Azure CLI (v2), specify its name as part of the
146-
`resources` properties section in the job YAML. For example:
145+
To select an instance type for a training job using the Azure CLI (v2), specify its name as part of the `resources` properties section in the job YAML. For example:
147146

148147
```yaml
149148
command: python -c "print('Hello world!')"
@@ -156,14 +155,14 @@ resources:
156155

157156
### [Python SDK](#tab/select-instancetype-to-trainingjob-with-sdk)
158157

159-
To select an instance type for a training job by using the SDK (v2), specify its name for the `instance_type` property in the `command` class. For example:
158+
To select an instance type for a training job using the SDK (v2), specify its name for the `instance_type` property in the `command` class. For example:
160159

161160
```python
162161
from azure.ai.ml import command
163162
164163
# define the command
165164
command_job = command(
166-
command="python -c "print('Hello world!')"",
165+
command="python -c print('Hello world!')"",
167166
environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
168167
compute="<Kubernetes-compute_target_name>",
169168
instance_type="<instance type name>"
@@ -178,7 +177,7 @@ In the preceding example, replace `<Kubernetes-compute_target_name>` with the na
178177

179178
### [Azure CLI](#tab/select-instancetype-to-modeldeployment-with-cli)
180179

181-
To select an instance type for a model deployment by using the Azure CLI (v2), specify its name for the `instance_type` property in the deployment YAML. For example:
180+
To select an instance type for a model deployment using the Azure CLI (v2), specify its name for the `instance_type` property in the deployment YAML. For example:
182181

183182
```yaml
184183
name: blue
@@ -197,7 +196,7 @@ environment:
197196

198197
### [Python SDK](#tab/select-instancetype-to-modeldeployment-with-sdk)
199198

200-
To select an instance type for a model deployment by using the SDK (v2), specify its name for the `instance_type` property in the `KubernetesOnlineDeployment` class. For example:
199+
To select an instance type for a model deployment using the SDK (v2), specify its name for the `instance_type` property in the `KubernetesOnlineDeployment` class. For example:
201200

202201
```python
203202
from azure.ai.ml import KubernetesOnlineDeployment,Model,Environment,CodeConfiguration
@@ -227,11 +226,11 @@ blue_deployment = KubernetesOnlineDeployment(
227226
In the preceding example, replace `<instance type name>` with the name of the instance type that you want to select. If you don't specify an `instance_type` property, the system uses `defaultinstancetype` to deploy the model.
228227

229228
> [!IMPORTANT]
230-
> For MLflow model deployment, the resource request requires at least 2 CPU cores and 4 GB of memory. Otherwise, the deployment will fail.
229+
> For MLflow model deployment, the resource request requires at least 2 CPU cores and 4 GB of memory. Otherwise, the deployment fails.
231230

232231
### Resource section validation
233232

234-
You can use the `resources` section to define the resource request and limit of your model deployments. For example:
233+
Use the `resources` section to define the resource request and limit for your model deployments. For example:
235234

236235
#### [Azure CLI](#tab/define-resource-to-modeldeployment-with-cli)
237236

@@ -297,19 +296,19 @@ blue_deployment = KubernetesOnlineDeployment(
297296

298297
---
299298

300-
If you use the `resources` section, a valid resource definition needs to meet the following rules. An invalid resource definition causes the model deployment to fail.
299+
When you use the `resources` section, a valid resource definition must meet the following rules. An invalid resource definition causes the model deployment to fail.
301300

302301
| Parameter | Required or optional | Description |
303302
| --- | --- | --- |
304-
| `requests:`<br>`cpu:`| Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`.|
305-
| `requests:`<br>`memory:` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB. <br>Memory can't be less than 1 MB.|
306-
| `limits:`<br>`cpu:` | Optional <br>(required only when you need GPU) | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`. |
307-
| `limits:`<br>`memory:` | Optional <br>(required only when you need GPU) | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 MiB.|
308-
| `limits:`<br>`nvidia.com/gpu:` | Optional <br>(required only when you need GPU) | Integer values, which can't be empty and can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). <br>If you require CPU only, you can omit the entire `limits` section.|
303+
| `requests:`<br>`cpu:`| Required | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`.|
304+
| `requests:`<br>`memory:` | Required | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB. <br>Memory can't be less than 1 MB.|
305+
| `limits:`<br>`cpu:` | Optional <br>(required only when you need GPU) | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`. |
306+
| `limits:`<br>`memory:` | Optional <br>(required only when you need GPU) | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 MiB.|
307+
| `limits:`<br>`nvidia.com/gpu:` | Optional <br>(required only when you need GPU) | Integer values that can't be empty and can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). <br>If you require CPU only, you can omit the entire `limits` section.|
309308

310-
The instance type is *required* for model deployment. If you defined the `resources` section, and it will be validated against the instance type, the rules are as follows:
309+
An instance type is *required* for model deployment. If you define the `resources` section, it's validated against the instance type according to the following rules:
311310

312-
- With a valid `resource` section definition, the resource limits must be less than the instance type limits. Otherwise, deployment will fail.
311+
- With a valid `resource` section definition, the resource limits must be less than the instance type limits. Otherwise, deployment fails.
313312
- If you don't define an instance type, the system uses `defaultinstancetype` for validation with the `resources` section.
314313
- If you don't define the `resources` section, the system uses the instance type to create the deployment.
315314

0 commit comments

Comments
 (0)