Skip to content

Commit 93e8096

Browse files
authored
editorial changes
1 parent fb5878c commit 93e8096

File tree

1 file changed

+36
-26
lines changed

1 file changed

+36
-26
lines changed
Lines changed: 36 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,65 @@
11
---
2-
title: Azure Kubernetes Service AI toolchain operator add-on issues
3-
description: Learn how to resolve issues that occur when you try to enable the Azure Kubernetes Service (AKS) AI toolchain operator add-on.
4-
ms.date: 05/06/2025
5-
author: sachidesai
6-
ms.author: sachidesai
2+
title: Errors when enabling AKS AI toolchain operator add-on
3+
description: Learn how to resolve errors that occur when you try to enable the Azure Kubernetes Service (AKS) AI toolchain operator add-on.
4+
ms.date: 05/09/2025
5+
ms.reviewer: sachidesai, v-weizhu
76
ms.service: azure-kubernetes-service
87
ms.custom: sap:Extensions, Policies and Add-Ons, references_regions
98
---
9+
# Errors when enabling AKS AI toolchain operator add-on
1010

11-
# AKS AI toolchain operator add-on issues
12-
13-
This article discusses how to troubleshoot problems that you may experience when you enable the Microsoft Azure Kubernetes Service (AKS) AI toolchain operator add-on during cluster creation or a cluster update.
11+
This article provides guidance on resolving errors that might occur when you enable the Microsoft Azure Kubernetes Service (AKS) AI Toolchain Operator (KAITO) add-on during a cluster creation or update.
1412

1513
## Prerequisites
1614

15+
Ensure the following tools are installed and configured. They'll be used in the following sections.
16+
1717
- [Azure CLI](/cli/azure/install-azure-cli)
1818
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), the Kubernetes command-line client
1919

2020
## Symptoms
2121

22-
AI toolchain operator add-on has two controllers: the `gpu-provisioner` controller and `workspace` controller. After enabling the add-on and deploying a KAITO workspace, you may experience one or more of the following errors in your pod logs:
22+
The KAITO add-on consists of two controllers: the `gpu-provisioner` controller and the `workspace` controller. After enabling the add-on and deploying a KAITO workspace, you might encounter one or more of the following errors in your pod logs:
2323

2424
| Error message | Cause |
25-
|--|--|
25+
| --- | --- |
2626
| Workspace was not created | [Cause 1: KAITO custom resource not configured properly](#cause-1-misconfiguration-in-kaito-custom-resource) |
27-
| GPU node was not created | [Cause 2: GPU quota limitations](#cause-2-gpu-quota-limitations) |
28-
| Resource ready condition is not `True` | [Cause 3: Long model inference image pull time](#cause-3-long-image-pull-time) |
27+
| GPU node was not created | [Cause 2: GPU quota limitations](#cause-2-gpu-quota-limitations) |
28+
| Resource ready condition is not `True` | [Cause 3: Long pull time for model inference images](#cause-3-long-pull-time-for-model-inference-images)|
29+
30+
## Cause 1: Misconfiguration in KAITO custom resource
2931

32+
After you enable the add-on and deploy a preset or custom workspace custom resource (CR), the `workspace` controller includes a validation webhook. This webhook blocks common mistakes of setting wrong values in the CR specification.
3033

31-
## Cause 1: Misconfiguration in KAITO Custom Resource
34+
To resolve this issue, follow these steps:
3235

33-
After you enable the add-on and deploy either a preset or custom workspace custom resource (CR), the workspace controller has a validation webhook that will block common mistakes of setting wrong values in the CR spec. Check your `gpu-provisioner` and workspace pod logs to ensure that any updates to GPU VM size meet the requirements of your model size. After the workspace CR is created successfully, track deployment progress by running the following commands:
36+
1. Check yourgpu-provisioner and workspace pod logs.
37+
2. Ensure that any updates to the GPU virtual machine (VM) size meet the requirements of your model size.
38+
3. Once the workspace CR is successfully created, track the deployment progress by running the following commands:
3439

35-
```azurecli
36-
kubectl get machine -o wide
37-
```
40+
```azurecli
41+
kubectl get machine -o wide
42+
```
3843
39-
```azurecli
40-
kubectl get workspace -o wide
41-
```
44+
```azurecli
45+
kubectl get workspace -o wide
46+
```
4247
4348
## Cause 2: GPU quota limitations
4449
45-
The `gpu-provisioner` controller may have failed to create the GPU node(s), in this case you can check the machine CR status (internal CR created by the workspace controller) for error messages. The machine CRs created by the `workspace` controller has a label key `kaito.sh/workspace` with workspace's name as the value. Address the potential GPU quota limitations by:
50+
The gpu-provisioner controller might fail to create GPU nodes due to quota limitations in your subscription or region. In this case, you can check the machine CR status (internal CR created by the workspace controller) for error messages. The machine CR created by the workspace controller has a kaito.sh/workspace label key with the workspace's name as the value.
51+
52+
To resolve this issue, use one of the following methods:
53+
54+
- Request an [increase in the subscription quota](/azure/quotas/quickstart-increase-quota-portal?) for the required GPU VM family of your deployment.
55+
- Check [GPU instance availability](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster.
4656
47-
1. Requesting an [increase in the subscription quota](/azure/quotas/quickstart-increase-quota-portal) for the required GPU VM family for your deployment.
57+
If the required GPU VM size is unavailable in your current region, consider switching to a different region or selecting an alternative GPU VM size.
4858
49-
2. Checking [GPU instance availability](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster; you may need to switch region or GPU VM size if not available.
59+
## Cause 3: Long pull time for model inference images
5060
51-
## Cause 3: Long image pull time
61+
If the image access mode is set to private, the model inference image might not be pulled. This issue can occur for images with specified URLs and pull secrets.
5262
53-
The model inference image may not be pulled if the image access mode is set to private. This can occur for images that you specify the URL and pull secret.
63+
The inference images are typically large (30 GB -100 GB), so a longer image pull time is expected. Depending on your AKS cluster's networking setup, the pull process might take up to tens of minutes.
5464
55-
Note that the inference images are usually large (30GB-100GB), hence please expect a longer image pull time (up to tens of minutes depending on your AKS cluster networking setup).
65+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]

0 commit comments

Comments
 (0)