You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-manage-kubernetes-instance-types.md
+36-37Lines changed: 36 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,31 +1,31 @@
1
1
---
2
2
title: Create and manage instance types for efficient utilization of compute resources
3
-
description: Learn about what instance types are, how to create and manage them, and what the benefits of using them are.
3
+
description: Learn what instance types are, how to create and manage them, and the benefits of using them.
4
4
titleSuffix: Azure Machine Learning
5
5
author: s-polly
6
6
ms.author: scottpolly
7
-
ms.reviewer: bozhlin
7
+
ms.reviewer: namanjoshi
8
8
ms.service: azure-machine-learning
9
9
ms.subservice: core
10
-
ms.date: 01/09/2024
10
+
ms.date: 07/23/2025
11
11
ms.topic: how-to
12
12
ms.custom: build-spring-2022, cliv2, sdkv2
13
13
---
14
14
15
15
# Create and manage instance types for efficient utilization of compute resources
16
16
17
-
Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For example, in an Azure virtual machine, an instance type is `STANDARD_D2_V3`. This article teaches you how to create and manage instance types for your computation requirements.
17
+
Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For example, in an Azure virtual machine, an instance type is `STANDARD_D2_V3`. This article shows you how to create and manage instance types for your computation requirements.
18
18
19
-
In Kubernetes clusters, instance types are represented in a custom resource definition (CRD) that's installed with the Azure Machine Learning extension. Two elements in the Azure Machine Learning extension represent the instance types:
19
+
In Kubernetes clusters, instance types are represented as a custom resource definition (CRD) installed with the Azure Machine Learning extension. Two elements in the Azure Machine Learning extension represent instance types:
20
20
21
-
- Use [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) to specify which node a pod should run on. The node must have a corresponding label.
22
-
- In the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section, you can set the compute resources (CPU, memory, and NVIDIA GPU) for the pod.
21
+
-**nodeSelector**: Use [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) to specify which node a pod should run on. The node must have a corresponding label.
22
+
-**resources**: In the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section, you can set the compute resources (CPU, memory, and NVIDIA GPU) for the pod.
23
23
24
-
If you [specify a nodeSelector field when deploying the Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings), the `nodeSelector` field will be applied to all instance types. This means that:
24
+
If you [specify a nodeSelector field when deploying the Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings), the `nodeSelector` field applies to all instance types. This means:
25
25
26
26
- For each instance type that you create, the specified `nodeSelector` field should be a subset of the extension-specified `nodeSelector` field.
27
-
- If you use an instance type with `nodeSelector`, the workload will run on any node that matches both the extension-specified `nodeSelector` field and the instance-type-specified `nodeSelector` field.
28
-
- If you use an instance type without a `nodeSelector` field, the workload will run on any node that matches the extension-specified `nodeSelector` field.
27
+
- If you use an instance type with `nodeSelector`, the workload runs on any node that matches both the extension-specified `nodeSelector` field and the instance-type-specified `nodeSelector` field.
28
+
- If you use an instance type without a `nodeSelector` field, the workload runs on any node that matches the extension-specified `nodeSelector` field.
29
29
30
30
## Create a default instance type
31
31
@@ -44,11 +44,11 @@ resources:
44
44
45
45
If you don't apply a `nodeSelector` field, the pod can be scheduled on any node. The workload's pods are assigned default resources with 0.1 CPU cores, 2 GB of memory, and 0 GPUs for the request. The resources that the workload's pods use are limited to 2 CPU cores and 8 GB of memory.
46
46
47
-
The default instance type purposefully uses few resources. To ensure that all machine learning workloads run with appropriate resources (for example, GPU resource), we highly recommend that you [create custom instance types](#create-a-custom-instance-type).
47
+
The default instance type purposefully uses minimal resources. To ensure that all machine learning workloads run with appropriate resources (for example, GPU resources), we highly recommend that you [create custom instance types](#create-a-custom-instance-type).
48
48
49
49
Keep in mind the following points about the default instance type:
50
50
51
-
- `defaultinstancetype`doesn't appear as an `InstanceType` custom resource in the cluster when you're running the command ```kubectl get instancetype```, but it does appear in all clients (UI, Azure CLI, SDK).
51
+
- `defaultinstancetype`doesn't appear as an `InstanceType` custom resource in the cluster when you run the command `kubectl get instancetype`, but it does appear in all clients (UI, Azure CLI, SDK).
52
52
- `defaultinstancetype`can be overridden with the definition of a custom instance type that has the same name.
53
53
54
54
## Create a custom instance type
@@ -79,25 +79,25 @@ spec:
79
79
memory: "1500Mi"
80
80
```
81
81
82
-
The preceding code creates an instance type with the labeled behavior:
82
+
The preceding code creates an instance type with the following behavior:
83
83
84
84
- Pods are scheduled only on nodes that have the label `mylabel: mylabelvalue`.
85
85
- Pods are assigned resource requests of `700m` for CPU and `1500Mi` for memory.
86
86
- Pods are assigned resource limits of `1` for CPU, `2Gi` for memory, and `1` for NVIDIA GPU.
87
87
88
-
Creation of custom instance types must meet the following parameters and definition rules, or it fails:
88
+
Custom instance type creation must meet the following parameters and definition rules, or it fails:
89
89
90
90
| Parameter | Required or optional | Description |
91
91
| --- | --- | --- |
92
-
| `name` | Required | String values, which must be unique in a cluster.|
93
-
| `CPU request` | Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
94
-
| `Memory request` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 mebibytes (MiB).|
95
-
| `CPU limit` | Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
96
-
| `Memory limit` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
97
-
| `GPU` | Optional | Integer values, which can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). |
92
+
| `name` | Required | String values that must be unique in a cluster.|
93
+
| `CPU request` | Required | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
94
+
| `Memory request` | Required | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 mebibytes (MiB).|
95
+
| `CPU limit` | Required | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
96
+
| `Memory limit` | Required | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
97
+
| `GPU` | Optional | Integer values that can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). |
98
98
| `nodeSelector` | Optional | Map of string keys and values. |
99
99
100
-
It's also possible to create multiple instance types at once:
100
+
You can also create multiple instance types at once:
101
101
102
102
```bash
103
103
kubectl apply -f my_instance_type_list.yaml
@@ -142,8 +142,7 @@ If you submit a training or inference workload without an instance type, it uses
To select an instance type for a training job by using the Azure CLI (v2), specify its name as part of the
146
-
`resources` properties section in the job YAML. For example:
145
+
To select an instance type for a training job using the Azure CLI (v2), specify its name as part of the `resources` properties section in the job YAML. For example:
To select an instance type for a training job by using the SDK (v2), specify its name for the `instance_type` property in the `command` class. For example:
158
+
To select an instance type for a training job using the SDK (v2), specify its name for the `instance_type` property in the `command` class. For example:
To select an instance type for a model deployment by using the Azure CLI (v2), specify its name for the `instance_type` property in the deployment YAML. For example:
180
+
To select an instance type for a model deployment using the Azure CLI (v2), specify its name for the `instance_type` property in the deployment YAML. For example:
To select an instance type for a model deployment by using the SDK (v2), specify its name for the `instance_type` property in the `KubernetesOnlineDeployment` class. For example:
199
+
To select an instance type for a model deployment using the SDK (v2), specify its name for the `instance_type` property in the `KubernetesOnlineDeployment` class. For example:
201
200
202
201
```python
203
202
from azure.ai.ml import KubernetesOnlineDeployment,Model,Environment,CodeConfiguration
In the preceding example, replace `<instance type name>` with the name of the instance type that you want to select. If you don't specify an `instance_type` property, the system uses `defaultinstancetype` to deploy the model.
228
227
229
228
> [!IMPORTANT]
230
-
> For MLflow model deployment, the resource request requires at least 2 CPU cores and 4 GB of memory. Otherwise, the deployment will fail.
229
+
> For MLflow model deployment, the resource request requires at least 2 CPU cores and 4 GB of memory. Otherwise, the deployment fails.
231
230
232
231
### Resource section validation
233
232
234
-
You can use the `resources` section to define the resource request and limit of your model deployments. For example:
233
+
Use the `resources` section to define the resource request and limit for your model deployments. For example:
If you use the `resources` section, a valid resource definition needs to meet the following rules. An invalid resource definition causes the model deployment to fail.
299
+
When you use the `resources` section, a valid resource definition must meet the following rules. An invalid resource definition causes the model deployment to fail.
301
300
302
301
| Parameter | Required or optional | Description |
303
302
| --- | --- | --- |
304
-
| `requests:`<br>`cpu:`| Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`.|
305
-
| `requests:`<br>`memory:` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB. <br>Memory can't be less than 1 MB.|
306
-
| `limits:`<br>`cpu:` | Optional <br>(required only when you need GPU) | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`. |
307
-
| `limits:`<br>`memory:` | Optional <br>(required only when you need GPU) | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 MiB.|
308
-
| `limits:`<br>`nvidia.com/gpu:` | Optional <br>(required only when you need GPU) | Integer values, which can't be empty and can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). <br>If you require CPU only, you can omit the entire `limits` section.|
303
+
| `requests:`<br>`cpu:`| Required | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`.|
304
+
| `requests:`<br>`memory:` | Required | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB. <br>Memory can't be less than 1 MB.|
305
+
| `limits:`<br>`cpu:` | Optional <br>(required only when you need GPU) | String values that can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`. |
306
+
| `limits:`<br>`memory:` | Optional <br>(required only when you need GPU) | String values that can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 MiB.|
307
+
| `limits:`<br>`nvidia.com/gpu:` | Optional <br>(required only when you need GPU) | Integer values that can't be empty and can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). <br>If you require CPU only, you can omit the entire `limits` section.|
309
308
310
-
The instance type is *required* for model deployment. If you defined the `resources` section, and it will be validated against the instance type, the rules are as follows:
309
+
An instance type is *required* for model deployment. If you define the `resources` section, it's validated against the instance type according to the following rules:
311
310
312
-
- With a valid `resource` section definition, the resource limits must be less than the instance type limits. Otherwise, deployment will fail.
311
+
- With a valid `resource` section definition, the resource limits must be less than the instance type limits. Otherwise, deployment fails.
313
312
- If you don't define an instance type, the system uses `defaultinstancetype` for validation with the `resources` section.
314
313
- If you don't define the `resources` section, the system uses the instance type to create the deployment.
0 commit comments