Skip to content

Commit 8106e67

Browse files
Merge pull request #3653 from MicrosoftDocs/main638832710815261913sync_temp
For protected branch, push strategy should use PR and merge to target branch method to work around git push error
2 parents 0c29beb + 48b6bc4 commit 8106e67

File tree

5 files changed

+151
-133
lines changed

5 files changed

+151
-133
lines changed

AKS-Arc/deploy-ai-model.md

Lines changed: 131 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -4,203 +4,209 @@ description: Learn how to deploy an AI model on AKS Arc with the Kubernetes AI t
44
author: sethmanheim
55
ms.author: sethm
66
ms.topic: how-to
7-
ms.date: 03/25/2025
7+
ms.date: 05/19/2025
88
ms.reviewer: haojiehang
9-
ms.lastreviewed: 12/03/2024
9+
ms.lastreviewed: 05/14/2025
1010

1111
---
1212

1313
# Deploy an AI model on AKS Arc with the Kubernetes AI toolchain operator (preview)
1414

1515
[!INCLUDE [hci-applies-to-23h2](includes/hci-applies-to-23h2.md)]
1616

17-
This article describes how to deploy an AI model on AKS Arc with the Kubernetes AI toolchain operator (KAITO). The AI toolchain operator (KAITO) is an add-on for AKS Arc, and it simplifies the experience of running OSS AI models on your AKS Arc clusters. To enable this feature, follow this workflow:
17+
This article describes how to deploy an AI model on AKS Arc with the *Kubernetes AI toolchain operator* (KAITO). The AI toolchain operator runs as a cluster extension in AKS Arc and makes it easier to deploy and run open source LLM models on your AKS Arc cluster. To enable this feature, follow this workflow:
1818

19-
1. Deploy KAITO on an existing cluster.
19+
1. Create a cluster with KAITO.
2020
1. Add a GPU node pool.
21-
1. Deploy the AI model.
22-
1. Validate the model deployment.
21+
1. Model deployment.
22+
1. Validate the model with a test prompt.
23+
1. Clean up resources.
24+
1. Troubleshoot as needed.
2325

2426
> [!IMPORTANT]
25-
> These preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. Azure Kubernetes Service, enabled by Azure Arc previews are partially covered by customer support on a best-effort basis.
27+
> The KAITO Extension for AKS on Azure Local is currently in PREVIEW.
28+
> See the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
2629
2730
## Prerequisites
2831

2932
Before you begin, make sure you have the following prerequisites:
3033

31-
1. The following details from your infrastructure administrator:
34+
- Make sure the Azure Local cluster has a supported GPU, such as A2, A16, or T4.
35+
- Make sure the AKS Arc cluster can deploy GPU node pools with the corresponding GPU VM SKU. For more information, see [use GPU for compute-intensive workloads](deploy-gpu-node-pool.md).
36+
- Make sure that **kubectl** is installed on your local machine. If you need to install **kubectl**, see [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/).
37+
- Install the **aksarc** extension, and make sure the version is at least 1.5.37. To get the list of installed CLI extensions, run `az extension list -o table`.
38+
- If you use a Powershell terminal, make sure the version is at least 7.4.
3239

33-
- An AKS Arc cluster that's up and running. For more information, see [Create Kubernetes clusters using Azure CLI](aks-create-clusters-cli.md).
34-
- Make sure that the AKS Arc cluster runs on the Azure Local cluster with a supported GPU model. Before you create the node pool, you must also identify the correct GPU VM SKUs based on the model. For more information, see [use GPU for compute-intensive workloads](deploy-gpu-node-pool.md).
35-
- We recommend using a computer running Linux for this feature.
36-
- Use `az connectedk8s proxy` to connect to your AKS Arc cluster.
40+
For all hosted model preset images and default resource configuration, see the [KAITO GitHub repository](https://github.com/kaito-project/kaito/tree/main/presets). All the preset models are originally from HuggingFace, and we do not change the model behavior during the redistribution. See the [content policy from HuggingFace](https://huggingface.co/content-policy).
3741

38-
1. Make sure that **helm** and **kubectl** are installed on your local machine.
42+
The AI toolchain operator extension currently supports KAITO version 0.4.5. Make a note of this in considering your choice of model from the KAITO model repository.
3943

40-
- If you need to install or upgrade, see [Install Helm](https://helm.sh/docs/intro/install/).
41-
- If you need to install **kubectl**, see [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/).
44+
## Create a cluster with KAITO
4245

43-
## Deploy KAITO from GitHub
46+
To create an AKS Arc cluster on Azure Local with KAITO, follow these steps:
4447

45-
You must have a running AKS Arc cluster with a default node pool. To deploy the KAITO operator, follow these steps:
48+
1. Gather [all required parameters](aks-create-clusters-cli.md) and include the `--enable-ai-toolchain-operator` parameter to enable KAITO as part of the cluster creation.
4649

47-
1. Clone the [KAITO repo](https://github.com/Azure/kaito.git) to your local machine.
48-
1. Install the KAITO operator using the following command:
49-
50-
```bash
51-
helm install workspace ./charts/kaito/workspace --namespace kaito-workspace --create-namespace
50+
```azurecli
51+
az aksarc create --resource-group <Resource_Group_name> --name <Cluster_Name> --custom-location <Custom_Location_Name> --vnet-ids <VNet_ID> --enable-ai-toolchain-operator
5252
```
5353

54-
## Add a GPU node pool
54+
1. After the command succeeds, make sure the KAITO extension is installed correctly and the KAITO operator under the `kaito` namespace is in a running state.
5555

56-
Before you add a GPU node, make sure that Azure Local is enabled with a supported GPU model, and that the GPU drivers are installed on all the host nodes. To create a GPU node pool using the Azure portal or Azure CLI, follow these steps:
56+
## Update an existing cluster with KAITO
5757

58-
### [Azure portal](#tab/portal)
58+
If you want to enable KAITO on an existing AKS Arc cluster with a GPU, you can run the following command to install the KAITO operator on the existing node pool:
5959

60-
To create a GPU node pool using the Azure portal, follow these steps:
60+
```azurecli
61+
az aksarc update --resource-group <Resource_Group_name> --name <Cluster_Name> --enable-ai-toolchain-operator
62+
```
6163

62-
1. Sign in to the Azure portal and find your AKS Arc cluster.
63-
1. Under **Settings** and **Node pools**, select **Add**. During the preview, we only support Linux nodes. Fill in the other required fields and create the node pool.
64+
## Add a GPU node pool
6465

65-
:::image type="content" source="media/deploy-ai-model/nodepools-portal.png" alt-text="Screenshot of node pools portal page." lightbox="media/deploy-ai-model/nodepools-portal.png":::
66+
1. Before you add a GPU node pool, make sure that Azure Local is enabled with a supported GPU such as A2, T4, or A16, and that the GPU drivers are installed on all the host nodes. To add a GPU node pool, follow these steps:
6667

67-
### [Azure CLI](#tab/azurecli)
68+
### [Azure portal](#tab/portal)
6869

69-
To create a GPU node pool using Azure CLI, run the following command. The GPU VM SKU used in the following example is for the **A16** model; for the full list of VM SKUs, see [Supported VM sizes](deploy-gpu-node-pool.md#supported-gpu-vm-sizes).
70+
Sign in to the Azure portal and find your AKS Arc cluster. Under **Settings > Node pools**, select **Add**. Fill in the other required fields, then create the node pool.
7071

71-
```azurecli
72-
az aksarc nodepool add --name "samplenodepool" --cluster-name "samplecluster" --resource-group "sample-rg" --node-vm-size "Standard_NC16_A16" --os-type "Linux"
73-
```
72+
:::image type="content" source="media/deploy-ai-model/add-gpu-node-pool.png" alt-text="Screenshot of portal showing add GPU node pool." lightbox="media/deploy-ai-model/add-gpu-node-pool.png":::
7473

75-
---
74+
### [Azure CLI](#tab/azurecli)
75+
76+
To create a GPU node pool using Azure CLI, run the following command. The GPU VM SKU used in the following example is for the **A16** model. For the full list of VM SKUs, see [Supported VM sizes](deploy-gpu-node-pool.md#supported-gpu-vm-sizes).
7677

77-
### Validate the node pool deployment
78+
```azurecli
79+
az aksarc nodepool add --name "samplenodepool" --cluster-name "samplecluster" --resource-group "sample-rg" --node-vm-size "Standard_NC16_A16" --os-type "Linux"
80+
```
7881

79-
After the node pool creation succeeds, you can confirm whether the GPU node is provisioned using `kubectl get nodes`. In the following example, the GPU node is **moc-l1i9uh0ksne**. The other node is from the default node pool that was created during the cluster creation:
82+
---
8083

81-
```bash
82-
kubectl get nodes
83-
```
84+
2. After the node pool is provisioned, you can confirm whether the node is successfully provisioned using the node pool name:
8485

85-
Expected output:
86+
```azurecli
87+
kubectl get nodes --show-labels | grep "msft.microsoft/nodepool-name=.*<Node_Pool_Name>" | awk '{print $1}'
88+
```
8689

87-
```output
88-
NAME STATUS ROLES AGE VERSION
89-
moc-l09jexqpg2k Ready <none> 31d v1.29.4
90-
moc-l1i9uh0ksne Ready <none> 26s v1.29.4
91-
moc-lkp2603zcvg Ready control-plane 31d v1.29.4
92-
```
90+
For PowerShell, you can use the following command:
9391

94-
You should also ensure that the node has allocatable GPU cores:
92+
```powershell
93+
kubectl get nodes --show-labels | Select-String "msft.microsoft/nodepool-name=.*<Node_Pool_Name>" | ForEach-Object { ($_ -split '\s+')[0] }
94+
```
9595

96-
```bash
97-
kubectl get node moc-l1i9uh0ksne -o yaml | grep -A 10 "allocatable:"
98-
```
96+
3. Label the newly provisioned GPU node so the inference workspace can be deployed to the node in the next step. You can make sure the label is applied using `kubectl get nodes`.
9997

100-
Expected output:
101-
102-
```output
103-
allocatable:
104-
cpu: 15740m
105-
ephemeral-storage: "95026644016"
106-
hugepages-1Gi: "0"
107-
hugepages-2Mi: "0"
108-
memory: 59730884Ki
109-
nvidia.com/gpu: "2"
110-
pods: "110"
111-
capacity:
112-
cpu: "16"
113-
ephemeral-storage: 103110508Ki
114-
```
98+
```powershell
99+
kubectl label node moc-l36c6vu97d5 apps=llm-inference
100+
```
115101

116-
## Deploy the AI model
102+
## Model deployment
117103

118104
To deploy the AI model, follow these steps:
119105

120-
1. Create a YAML file using the following template. KAITO supports popular OSS models such as Falcon, Phi3, Llama2, and Mistral. This list might increase over time.
121-
122-
- The **PresetName** is used to specify which model to deploy, and you can find its value in the [supported model file](https://github.com/kaito-project/kaito/blob/main/presets/workspace/models/supported_models.yaml) in the GitHub repo. In the following example, `falcon-7b-instruct` is used for the model deployment.
123-
- We recommend using `labelSelector` and `preferredNodes` to explicitly select the GPU node by name. In the following example, `app: llm-inference` is used for the GPU node `moc-le4aoguwyd9`. You can choose any node label you want, as long as the labels match. The next step shows how to label the node.
106+
1. Create a YAML file with the following sample file. In this example, we use the Phi 3.5 Mini model by specifying the preset name as **phi-3.5-mini-instruct**. If you want to use other LLMs, use the preset name from the KAITO repo. You should also make sure that the LLM can deploy on the VM SKU based on the matrix table in the "Model VM SKU Matrix" section.
124107

125108
```yaml
126-
apiVersion: kaito.sh/v1alpha1
109+
apiVersion: kaito.sh/v1beta1
127110
kind: Workspace
128111
metadata:
129-
name: workspace-falcon-7b
112+
name: workspace-llm
130113
resource:
131-
labelSelector:
132-
matchLabels:
133-
apps: llm-inference
134-
preferredNodes:
135-
- moc-le4aoguwyd9
114+
instanceType: <GPU_VM_SKU> # Update this value with GPU VM SKU
115+
labelSelector:
116+
matchLabels:
117+
apps: llm-inference
118+
preferredNodes:
119+
- moc-l36c6vu97d5 # Update the value with GPU VM name
136120
inference:
137-
preset:
138-
name: "falcon-7b-instruct"
121+
preset:
122+
name: phi-3.5-mini-instruct # Update preset name as needed
123+
config: "ds-inference-params"
124+
---
125+
apiVersion: v1
126+
kind: ConfigMap
127+
metadata:
128+
name: ds-inference-params
129+
data:
130+
inference_config.yaml: |
131+
max_probe_steps: 6 # Maximum number of steps to find the max available seq len fitting in the GPU memory.
132+
vllm:
133+
cpu-offload-gb: 0
134+
swap-space: 4
135+
gpu-memory-utilization: 0.9
136+
max-model-len: 4096
139137
```
140138
141-
1. Label the GPU node using **kubectl**, so that the YAML file knows which node can be used for deployment:
139+
1. Apply the YAML and wait until the deployment completes. Make sure that internet connectivity is good so that the model can be downloaded from the Hugging Face website within a few minutes. When the inference workspace is successfully provisioned, both **ResourceReady** and **InferenceReady** become **True**. See the "Troubleshooting" section if you encounter any failures in the workspace deployment.
142140
143-
```bash
144-
kubectl label node moc-le4aoguwyd9 app=llm-inference
141+
```azurecli
142+
kubectl apply -f sampleyamlfile.yaml
145143
```
146144

147-
1. Apply the YAML file and wait until the workplace deployment completes:
145+
1. Validate that the workspace deployment succeeded:
148146

149-
```bash
150-
kubectl apply -f sampleyamlfile.yaml
147+
```azurecli
148+
kubectl get workspace -A
151149
```
152150

153-
## Validate the model deployment
151+
## Validate the model with a test prompt
154152

155-
To validate the model deployment, follow these steps:
153+
After the resource and inference states become ready, the inference service is exposed internally via a Cluster IP. You can test the model with the following prompt:
156154

157-
1. Validate the workspace using the `kubectl get workspace` command. Also make sure that both the `ResourceReady` and `InferenceReady` fields are set to **True** before testing with the prompt:
155+
```bash
156+
export CLUSTERIP=$(kubectl get svc workspace-llm -o jsonpath="{.spec.clusterIPs[0]}")
157+
158+
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/v1/completions
159+
-H "Content-Type: application/json"
160+
-d '{
161+
"model": "phi-3.5-mini-instruct",
162+
"prompt": "What is kubernetes?",
163+
"max_tokens": 20,
164+
"temperature": 0
165+
}'
166+
```
158167

159-
```bash
160-
kubectl get workspace
161-
```
168+
```powershell
169+
$CLUSTERIP = $(kubectl get svc workspace-llm -o jsonpath="{.spec.clusterIPs[0]}" )
170+
$jsonContent = '{
171+
"model": "phi-3.5-mini-instruct",
172+
"prompt": "What is kubernetes?",
173+
"max_tokens": 20,
174+
"temperature": 0
175+
}'
176+
177+
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/v1/completions -H "accept: application/json" -H "Content-Type: application/json" -d $jsonContent
178+
```
162179

163-
Expected output:
180+
## Clean up resources
164181

165-
```output
166-
NAME INSTANCE RESOURCEREADY INFERENCEREADY JOBSTARTED WORKSPACESUCCEEDED AGE
167-
workspace-falcon-7b Standard_NC16_A16 True True True 18h
168-
```
182+
To clean up the resources, remove both the inference workspace and the extension:
169183

170-
1. After the resource and inference is ready, the **workspace-falcon-7b** inference service is exposed internally and can be accessed with a cluster IP. You can test the model with the following prompt. For more information about features in the KAITO inference, see the [instructions in the KAITO repo](https://github.com/kaito-project/kaito/blob/main/docs/inference/README.md#inference-workload).
184+
```azurecli
185+
kubectl delete workspace workspace-llm
171186
172-
```bash
173-
export CLUSTERIP=$(kubectl get svc workspace-falcon-7b -o jsonpath="{.spec.clusterIPs[0]}")
187+
az aksarc update --resource-group <Resource_Group_name> --name <Cluster_Name> --disable-ai-toolchain-operator
188+
```
174189

175-
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"<sample_prompt>\"}"
176-
```
190+
## Model VM SKU Matrix
177191

178-
Expected output:
179-
180-
```bash
181-
usera@quke-desktop: $ kubectl run -it -rm -restart=Never curl -image=curlimages/curl - curl -X POST http
182-
://$CLUSTERIP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"Write a short story about a person who discovers a hidden room in their house .? \"}"
183-
If you don't see a command prompt, try pressing enter.
184-
{"Result": "Write a short story about a person who discovers a hidden room in their house .? ?\nThe door is lo
185-
cked from both the inside and outside, and there appears not to be any other entrance. The walls of the room
186-
seem to be made of stone, although there are no visible seams, or any other indication of where the walls e
187-
nd and the floor begins. The only furniture in the room is a single wooden chair, a small candle, and what a
188-
ppears to be a bed. (The bed is covered entirely with a sheet, and is not visible from the doorway. )\nThe on
189-
ly light in the room comes from a single candle on the floor of the room. The door is solid and does not app
190-
ear to have hinges or a knob. The walls seem to go on forever into the darkness, and there is a chill, wet f
191-
eeling in the air that makes the hair stand up on the back of your neck. \nThe chair sits on the floor direct
192-
ly across from the door. The chair"}pod "curl" deleted
193-
```
192+
The following table shows the supported GPU models and their corresponding VM SKUs. The GPU model is used to determine the VM SKU when you create a node pool. For more information about the GPU models, see [Supported GPU models](scale-requirements.md#supported-gpu-models).
194193

195-
## Troubleshooting
194+
| Type | T4 | A2 or A16 | A2 or A16 |
195+
|-------------------------------------|---------------------|-----------------------------------|-------------------------------------|
196+
| Model VM SKU Matrix | Standard_NK6 | Standard_NC4, Standard_NC8 | Standard_NC32, Standard_NC16 |
197+
| phi-3-mini-4k-instruct | Y | Y | Y |
198+
| phi-3-mini-128k-instruct | N | Y | Y |
199+
| phi-3.5-mini-instruct | N | Y | Y |
200+
| phi-4-mini-instruct | N | N | Y |
201+
| deepseek-r1-distill-llama-8b | N | N | Y |
202+
| mistral-7b/mistral-7b-instruct | N | N | Y |
203+
| qwen2.5-coder-7b-instruct | N | N | Y |
196204

197-
If the pod is not deployed properly or the **ResourceReady** field shows empty or **false**, it's usually because the preferred GPU node isn't labeled correctly. Check the node label with `kubectl get node <yourNodeName> --show-labels`. For example, in the YAML file, the following code specifies that the node must have the label `apps=llm-inference`:
205+
## Troubleshooting
198206

199-
```yaml
200-
labelSelector:
201-
matchLabels:
202-
apps: llm-inference
203-
```
207+
1. If you want to deploy an LLM and see the error **OutOfMemoryError: CUDA out of memory**, please raise an issue in the [KAITO repo](https://github.com/kaito-project/kaito/).
208+
1. If you see the error **(ExtensionOperationFailed) The extension operation failed with the following error: Unable to get a response from the Agent in time** during extension installation, [see this TSG](/troubleshoot/azure/azure-kubernetes/extensions/cluster-extension-deployment-errors#error-unable-to-get-a-response-from-the-agent-in-time) and ensure the extension agent in the AKS Arc cluster can connect to Azure.
209+
1. If you see an error during prompt testing such as **{"detail":[{"type":"json_invalid","loc":["body",1],"msg":"JSON decode error","input":{},"ctx":{"error":"Expecting property name enclosed in double quotes"}}]}**, it's possible that your PowerShell terminal version is 5.1. Make sure the terminal version is at least 7.4.
204210

205211
## Next steps
206212

238 KB
Loading

0 commit comments

Comments
 (0)