Skip to content

Commit e8112ff

Browse files
committed
edit pass: how-to-manage-kubernetes-instance-types
1 parent cd98798 commit e8112ff

File tree

1 file changed

+52
-53
lines changed

1 file changed

+52
-53
lines changed

articles/machine-learning/how-to-manage-kubernetes-instance-types.md

Lines changed: 52 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -14,22 +14,22 @@ ms.custom: build-spring-2022, cliv2, sdkv2, event-tier1-build-2022
1414

1515
# Create and manage instance types for efficient utilization of compute resources
1616

17-
Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For an Azure VM, an example for an instance type is `STANDARD_D2_V3`.
17+
Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For an Azure virtual machine (VM), an example of an instance type is `STANDARD_D2_V3`.
1818

19-
In Kubernetes clusters, instance types are represented in a custom resource definition (CRD) that is installed with the Azure Machine Learning extension. Two elements in Azure Machine Learning extension represent the instance types:
19+
In Kubernetes clusters, instance types are represented in a custom resource definition (CRD) that's installed with the Azure Machine Learning extension. Two elements in the Azure Machine Learning extension represent the instance types:
2020

2121
- Use [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) to specify which node a pod should run on. The node must have a corresponding label.
22-
- In the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section, you can set the compute resources (CPU, memory and NVIDIA GPU) for the pod.
22+
- In the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) section, you can set the compute resources (CPU, memory, and NVIDIA GPU) for the pod.
2323

24-
If you [specify a nodeSelector when deploying the Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings), the nodeSelector will be applied to all instance types. This means that:
24+
If you [specify a nodeSelector field when deploying the Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings), the `nodeSelector` field will be applied to all instance types. This means that:
2525

26-
- For each instance type creating, the specified nodeSelector should be a subset of the extension-specified nodeSelector.
27-
- If you use an instance type **with nodeSelector**, the workload will run on any node matching both the extension-specified nodeSelector and the instance type-specified nodeSelector.
28-
- If you use an instance type **without a nodeSelector**, the workload will run on any node matching the extension-specified nodeSelector.
26+
- For each instance type that you create, the specified `nodeSelector` field should be a subset of the extension-specified `nodeSelector` field.
27+
- If you use an instance type with `nodeSelector`, the workload will run on any node that matches both the extension-specified `nodeSelector` field and the instance type-specified `nodeSelector` field.
28+
- If you use an instance type without a `nodeSelector` field, the workload will run on any node that matches the extension-specified `nodeSelector` field.
2929

3030
## Create a default instance type
3131

32-
By default, a `defaultinstancetype` with the following definition is created when you attach a Kubernetes cluster to an Azure Machine Learning workspace.
32+
By default, an instance type called `defaultinstancetype` is created when you attach a Kubernetes cluster to an Azure Machine Learning workspace. Here's the definition:
3333

3434
```yaml
3535
resources:
@@ -42,12 +42,14 @@ resources:
4242
nvidia.com/gpu: null
4343
```
4444
45-
If you don't apply a `nodeSelector`, it means the pod can get scheduled on any node. The workload's pods are assigned default resources with 0.1 cpu cores, 2-GB memory and 0 GPU for request. The resources used by the workload's pods are limited to 2 cpu cores and 8-GB memory:
45+
If you don't apply a `nodeSelector` field, the pod can be scheduled on any node. The workload's pods are assigned default resources with 0.1 CPU cores, 2 GB of memory, and 0 GPUs for the request. The resources that the workload's pods use are limited to 2 CPU cores and 8 GB of memory.
4646

47-
The default instance type purposefully uses little resources. To ensure all ML workloads run with appropriate resources, for example GPU resource, it is highly recommended to create custom instance types.
47+
The default instance type purposefully uses few resources. To ensure that all machine learning workloads run with appropriate resources (for example, GPU resource), we highly recommend that you create custom instance types.
4848

49-
- `defaultinstancetype` will not appear as an InstanceType custom resource in the cluster when running the command ```kubectl get instancetype```, but it will appear in all clients (UI, CLI, SDK).
50-
- `defaultinstancetype` can be overridden with a custom instance type definition having the same name as `defaultinstancetype` (see [Create custom instance types](#create-a-custom-instance-type) section)
49+
Keep in mind the following points about the default instance type:
50+
51+
- `defaultinstancetype` doesn't appear as an `InstanceType` custom resource in the cluster when you're running the command ```kubectl get instancetype```, but it does appear in all clients (UI, Azure CLI, SDK).
52+
- `defaultinstancetype` can be overridden with a [custom instance type](#create-a-custom-instance-type) definition that has the same name.
5153

5254
## Create a custom instance type
5355

@@ -57,7 +59,7 @@ To create a new instance type, create a new custom resource for the instance typ
5759
kubectl apply -f my_instance_type.yaml
5860
```
5961

60-
With `my_instance_type.yaml`:
62+
Here are the contents of *my_instance_type.yaml*:
6163

6264
```yaml
6365
apiVersion: amlarc.azureml.com/v1alpha1
@@ -77,31 +79,31 @@ spec:
7779
memory: "1500Mi"
7880
```
7981

80-
The following steps create an instance type with the labeled behavior:
82+
The preceding code creates an instance type with the labeled behavior:
8183

82-
- Pods are scheduled only on nodes with label `mylabel: mylabelvalue`.
83-
- Pods are assigned resource requests of `700m` CPU and `1500Mi` memory.
84-
- Pods are assigned resource limits of `1` CPU, `2Gi` memory and `1` NVIDIA GPU.
84+
- Pods are scheduled only on nodes that have the label `mylabel: mylabelvalue`.
85+
- Pods are assigned resource requests of `700m` for CPU and `1500Mi` for memory.
86+
- Pods are assigned resource limits of `1` for CPU, `2Gi` for memory, and `1` for NVIDIA GPU.
8587

86-
Creation of custom instance types must meet the following parameters and definition rules, otherwise the instance type creation fails:
88+
Creation of custom instance types must meet the following parameters and definition rules, or it will fail:
8789

88-
| Parameter | Required | Description |
90+
| Parameter | Required or optional | Description |
8991
| --- | --- | --- |
90-
| name | required | String values, which must be unique in cluster.|
91-
| CPU request | required | String values, which cannot be 0 or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers; for example, `"1"` is equivalent to `1000m`.|
92-
| Memory request | required | String values, which cannot be 0 or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
93-
| CPU limit | required | String values, which cannot be 0 or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers; for example, `"1"` is equivalent to `1000m`.|
94-
| Memory limit | required | String values, which cannot be 0 or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
95-
| GPU | optional | Integer values, which can only be specified in the `limits` section. <br>For more information, see the Kubernetes [documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). |
96-
| nodeSelector | optional | Map of string keys and values. |
92+
| `name` | Required | String values, which must be unique in a cluster.|
93+
| `CPU request` | Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
94+
| `Memory request` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 mebibytes (MiB).|
95+
| `CPU limit` | Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it as full numbers. For example, `"1"` is equivalent to `1000m`.|
96+
| `Memory limit` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB.|
97+
| `GPU` | Optional | Integer values, which can only be specified in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). |
98+
| `nodeSelector` | Optional | Map of string keys and values. |
9799

98100
It's also possible to create multiple instance types at once:
99101

100102
```bash
101103
kubectl apply -f my_instance_type_list.yaml
102104
```
103105

104-
With `my_instance_type_list.yaml`:
106+
Here are the contents of *my_instance_type_list.yaml*:
105107

106108
```yaml
107109
apiVersion: amlarc.azureml.com/v1alpha1
@@ -132,16 +134,16 @@ items:
132134
memory: "1Gi"
133135
```
134136

135-
The above example creates two instance types: `cpusmall` and `defaultinstancetype`. This `defaultinstancetype` definition overrides the `defaultinstancetype` definition created when Kubernetes cluster was attached to Azure Machine Learning workspace.
137+
The preceding example creates two instance types: `cpusmall` and `defaultinstancetype`. This `defaultinstancetype` definition overrides the `defaultinstancetype` definition that was created when you attached the Kubernetes cluster to the Azure Machine Learning workspace.
136138

137-
If you submit a training or inference workload without an instance type, it uses the `defaultinstancetype`. To specify a default instance type for a Kubernetes cluster, create an instance type with name `defaultinstancetype`. It's automatically recognized as the default.
139+
If you submit a training or inference workload without an instance type, it uses `defaultinstancetype`. To specify a default instance type for a Kubernetes cluster, create an instance type with the name `defaultinstancetype`. It's automatically recognized as the default.
138140

139141
## Select an instance type to submit a training job
140142

141143
### [Azure CLI](#tab/select-instancetype-to-trainingjob-with-cli)
142144

143-
To select an instance type for a training job using CLI (V2), specify its name as part of the
144-
`resources` properties section in job YAML. For example:
145+
To select an instance type for a training job by using the Azure CLI (V2), specify its name as part of the
146+
`resources` properties section in the job YAML. For example:
145147

146148
```yaml
147149
command: python -c "print('Hello world!')"
@@ -154,7 +156,7 @@ resources:
154156

155157
### [Python SDK](#tab/select-instancetype-to-trainingjob-with-sdk)
156158

157-
To select an instance type for a training job using SDK (V2), specify its name for `instance_type` property in `command` class. For example:
159+
To select an instance type for a training job by using the SDK (V2), specify its name for the `instance_type` property in the `command` class. For example:
158160

159161
```python
160162
from azure.ai.ml import command
@@ -170,13 +172,13 @@ command_job = command(
170172

171173
---
172174

173-
In the above example, replace `<Kubernetes-compute_target_name>` with the name of your Kubernetes compute target and replace `<instance_type_name>` with the name of the instance type you wish to select. If there's no `instance_type` property specified, the system uses `defaultinstancetype` to submit the job.
175+
In the preceding example, replace `<Kubernetes-compute_target_name>` with the name of your Kubernetes compute target. Replace `<instance_type_name>` with the name of the instance type that you want to select. If you don't specify an `instance_type` property, the system uses `defaultinstancetype` to submit the job.
174176

175177
## Select an instance type to deploy a model
176178

177179
### [Azure CLI](#tab/select-instancetype-to-modeldeployment-with-cli)
178180

179-
To select an instance type for a model deployment using CLI (V2), specify its name for the `instance_type` property in the deployment YAML. For example:
181+
To select an instance type for a model deployment by using the Azure CLI (V2), specify its name for the `instance_type` property in the deployment YAML. For example:
180182

181183
```yaml
182184
name: blue
@@ -195,7 +197,7 @@ environment:
195197

196198
### [Python SDK](#tab/select-instancetype-to-modeldeployment-with-sdk)
197199

198-
To select an instance type for a model deployment using SDK (V2), specify its name for the `instance_type` property in the `KubernetesOnlineDeployment` class. For example:
200+
To select an instance type for a model deployment by using the SDK (V2), specify its name for the `instance_type` property in the `KubernetesOnlineDeployment` class. For example:
199201

200202
```python
201203
from azure.ai.ml import KubernetesOnlineDeployment,Model,Environment,CodeConfiguration
@@ -222,14 +224,14 @@ blue_deployment = KubernetesOnlineDeployment(
222224

223225
---
224226

225-
In the above example, replace `<instance_type_name>` with the name of the instance type you wish to select. If there's no `instance_type` property specified, the system uses `defaultinstancetype` to deploy the model.
227+
In the preceding example, replace `<instance_type_name>` with the name of the instance type that you want to select. If you don't specify an `instance_type` property, the system uses `defaultinstancetype` to deploy the model.
226228

227229
> [!IMPORTANT]
228-
> For MLFlow model deployment, the resource request require at least 2 CPU and 4 GB memory. Otherwise, the deployment will fail.
230+
> For MLflow model deployment, the resource request requires at least 2 CPU cores and 4 GB of memory. Otherwise, the deployment will fail.
229231

230232
### Resource section validation
231233

232-
If you're using the `resource section` to define the resource request and limit of your model deployments, for example:
234+
You can use the `resources` section to define the resource request and limit of your model deployments. For example:
233235

234236
#### [Azure CLI](#tab/define-resource-to-modeldeployment-with-cli)
235237

@@ -295,26 +297,23 @@ blue_deployment = KubernetesOnlineDeployment(
295297

296298
---
297299

298-
If you use the `resource section`, the valid resource definition need to meet the following rules, otherwise the model deployment fails due to the invalid resource definition:
300+
If you use the `resources` section, a valid resource definition needs to meet the following rules. An invalid resource definition will cause the model deployment to fail.
299301

300-
| Parameter | If necessary | Description |
302+
| Parameter | Required or optional | Description |
301303
| --- | --- | --- |
302-
| `requests:`<br>`cpu:`| Required | String values, which can't be 0 or empty. <br>You can specify the CPU in millicores, for example `100m`, or in full numbers, for example `"1"` is equivalent to `1000m`.|
303-
| `requests:`<br>`memory:` | Required | String values, which can't be 0 or empty. <br>You can specify the memory as a full number + suffix, for example `1024Mi` for 1024 MiB. <br>Memory can't be less than **1 MBytes**.|
304-
| `limits:`<br>`cpu:` | Optional <br>(only required when need GPU) | String values, which can't be 0 or empty. <br>You can specify the CPU in millicores, for example `100m`, or in full numbers, for example `"1"` is equivalent to `1000m`. |
305-
| `limits:`<br>`memory:` | Optional <br>(only required when need GPU) | String values, which can't be 0 or empty. <br>You can specify the memory as a full number + suffix, for example `1024Mi` for 1024 MiB.|
306-
| `limits:`<br>`nvidia.com/gpu:` | Optional <br>(only required when need GPU) | Integer values, which can't be empty and can only be specified in the `limits` section. <br>For more information, see the Kubernetes [documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). <br>If require CPU only, you can omit the entire `limits` section.|
307-
308-
> [!NOTE]
309-
> If the resource section definition is invalid, the deployment will fail.
304+
| `requests:`<br>`cpu:`| Required | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example, `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`.|
305+
| `requests:`<br>`memory:` | Required | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1024 MiB. <br>Memory can't be less than 1 MB.|
306+
| `limits:`<br>`cpu:` | Optional <br>(required only when you need GPU) | String values, which can't be zero or empty. <br>You can specify the CPU in millicores; for example `100m`. You can also specify it in full numbers. For example, `"1"` is equivalent to `1000m`. |
307+
| `limits:`<br>`memory:` | Optional <br>(required only when you need GPU) | String values, which can't be zero or empty. <br>You can specify the memory as a full number + suffix; for example, `1024Mi` for 1,024 MiB.|
308+
| `limits:`<br>`nvidia.com/gpu:` | Optional <br>(required only when you need GPU) | Integer values, which can't be empty and can be specified only in the `limits` section. <br>For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). <br>If you require CPU only, you can omit the entire `limits` section.|
310309

311-
The `instance type` is *required* for model deployment. If you have defined the resource section, and it will be validated against the instance type, the rules are as follows:
310+
The instance type is *required* for model deployment. If you defined the `resources` section, and it will be validated against the instance type, the rules are as follows:
312311

313-
- With a valid resource section definition, the resource limits must be less than instance type limits, otherwise deployment will fail.
314-
- If the user does not define instance type, the `defaultinstancetype` will be used to be validated with resource section.
315-
- If the user does not define resource section, the instance type will be used to create deployment.
312+
- With a valid `resource` section definition, the resource limits must be less than the instance type limits. Otherwise, deployment will fail.
313+
- If you don't define an instance type, the system uses `defaultinstancetype` for validation with the `resources` section.
314+
- If you don't define the `resources` section, the system uses the instance type to create the deployment.
316315

317316
## Next steps
318317

319318
- [Azure Machine Learning inference router and connectivity requirements](./how-to-kubernetes-inference-routing-azureml-fe.md)
320-
- [Secure AKS inferencing environment](./how-to-secure-kubernetes-inferencing-environment.md)
319+
- [Secure Azure Kubernetes Service inferencing environment](./how-to-secure-kubernetes-inferencing-environment.md)

0 commit comments

Comments
 (0)