Skip to content

Commit e93dd07

Browse files
authored
TSG
1 parent 8ef42f4 commit e93dd07

File tree

1 file changed

+55
-0
lines changed

1 file changed

+55
-0
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
title: Azure Kubernetes Service AI toolchain operator add-on issues
3+
description: Learn how to resolve issues that occur when you try to enable the Azure Kubernetes Service (AKS) AI toolchain operator add-on.
4+
ms.date: 05/06/5025
5+
author: sachidesai
6+
ms.author: sachidesai
7+
ms.service: azure-kubernetes-service
8+
ms.custom: sap:Extensions, Policies and Add-Ons, references_regions
9+
---
10+
11+
# AKS AI toolchain operator add-on issues
12+
13+
This article discusses how to troubleshoot problems that you may experience when you enable the Microsoft Azure Kubernetes Service (AKS) AI toolchain operator add-on during cluster creation or a cluster update.
14+
15+
## Prerequisites
16+
17+
- [Azure CLI](/cli/azure/install-azure-cli)
18+
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), the Kubernetes command-line client
19+
20+
## Symptoms
21+
22+
AI toolchain operator add-on has two controllers: the `gpu-provisioner` controller and `workspace` controller. After enabling the add-on and deploying a KAITO workspace, you may experience one or more of the following errors in your pod logs:
23+
24+
| Error message | Cause |
25+
|--|--|
26+
| Workspace was not created | [Cause 1: KAITO custom resource not configured properly](#cause-1-misconfiguration-in-kaito-custom-resource) |
27+
| GPU node was not created | [Cause 2: GPU quota limitations](#cause-2-gpu-quota-limitations) |
28+
| Resource ready condition is not `True` | [Cause 3: Long model inference image pull time](#cause-3-long-image-pull-time) |
29+
30+
31+
## Cause 1: Misconfiguration in KAITO Custom Resource
32+
33+
After you enable the add-on and deploy either a preset or custom workspace custom resource (CR), the workspace controller has a validation webhook that will block common mistakes of setting wrong values in the CR spec. Check your `gpu-provisioner` and workspace pod logs to ensure that any updates to GPU VM size meet the requirements of your model size. After the workspace CR is created successfully, track deployment progress by running the following commands:
34+
35+
```azurecli
36+
kubectl get machine -o wide
37+
```
38+
39+
```azurecli
40+
kubectl get workspace -o wide
41+
```
42+
43+
## Cause 2: GPU quota limitations
44+
45+
The `gpu-provisioner` controller may have failed to create the GPU node(s), in this case you can check the machine CR status (internal CR created by the workspace controller) for error messages. The machine CRs created by the `workspace` controller has a label key `kaito.sh/workspace` with workspace's name as the value. Address the potential GPU quota limitations by:
46+
47+
1. Requesting an [increase in the subscription quota](/azure/quotas/quickstart-increase-quota-portal) for the required GPU VM family for your deployment.
48+
49+
2. Checking [GPU instance availability](https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster; you may need to switch region or GPU VM size if not available.
50+
51+
## Cause 3: Long image pull time
52+
53+
The model inference image may not be pulled if the image access mode is set to private. This can occur for images that you specify the URL and pull secret.
54+
55+
Note that the inference images are usually large (30GB-100GB), hence please expect a longer image pull time (up to tens of minutes depending on your AKS cluster networking setup).

0 commit comments

Comments
 (0)