Skip to content

Commit f2e6a47

Browse files
Merge pull request #39 from oracle-quickstart/installing_onto_existing_cluster
Installing OCI AI Blueprints onto existing cluster
2 parents c36bafe + 78a3a42 commit f2e6a47

File tree

5 files changed

+269
-2
lines changed

5 files changed

+269
-2
lines changed
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# Install OCI AI Blueprints onto an Existing OKE Cluster
2+
3+
This guide helps you install and use **OCI AI Blueprints** for the first time on an existing OKE cluster that was created outside of blueprints which already has workflows running on it. You will:
4+
5+
1. Ensure you have the correct IAM policies in place.
6+
2. Retrieve existing cluster OKE and VCN names from console.
7+
3. Deploy the **OCI AI Blueprints** application onto the existing cluster.
8+
4. Learn how to add existing nodes in the cluster to be used by blueprints.
9+
5. Deploy a sample recipe to that node.
10+
6. Test your deployment and undeploy
11+
7. Destroy the stack
12+
13+
There is an additional section at the bottom for users who have the nvidia-gpu-operator installed, and would like to use Multi-Instance GPUs with H100 nodes visit [this section](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#multi-instance-gpu-setup).
14+
15+
Additionally, visit [this section](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#need-help) if you need to contact the team about setup issues.
16+
17+
---
18+
19+
## Overview
20+
21+
Rather than installing blueprints onto a new cluster, a user may want to leverage an existing cluster with node pools and tools already installed. This doc will cover the components needed to deploy to an existing cluster, how to add existing node pools to be used by blueprints, and additional considerations which should be considered.
22+
23+
## Step 1: Set Up Policies in Your Tenancy
24+
25+
Some or all of these policies may be in place as required by OKE. Please review the required policies listed [here](docs/iam_policies/README.md) and add any required policies which are missing.
26+
27+
1. If you are **not** a tenancy administrator, ask your admin to add additional required policies in the **root compartment**.
28+
2. If you **are** a tenancy administrator, you can either manually add the additional policies to an existing dynamic group, or let the resource manager deploy the required policies during stack creation.
29+
30+
## Step 2: Retrieve Existing OKE OCID and VCN OCID
31+
32+
1. Navigate to the console.
33+
2. Go to the region that contains the cluster you wish to deploy blueprints onto.
34+
3. Navigate to **Developer Services -> Containers & Artifacts -> Kubernetes Clusters (OKE) -> YourCluster**
35+
4. Click your cluster, and then capture the name of the cluster and the name of the VCN as they will be used during stack creation.
36+
37+
## Step 3: Deploy the OCI AI Blueprints Application
38+
39+
1. Go to [Deploy the OCI AI Blueprints Application](./GETTING_STARTED_README.md#step-3-deploy-the-oci-ai-blueprints-application) and click the button to deploy.
40+
2. Go to the correct region where your cluster is deployed.
41+
3. If you have not created policies and are an admin, and would like the stack to deploy the policies for you:
42+
43+
- select "NO" for the question "Have you enabled the required policies in the root tenancy for OCI AI Blueprints?"
44+
- select "YES" for the question "Are you an administrator for this tenancy?".
45+
- Under the section "OCI AI Blueprints IAM", click the checkbox to create the policies. (If you do not see this, ensure you've selected the correct choices for the questions above.)
46+
47+
- Otherwise, create the policies if you are an admin, or have your admin create the policies.
48+
4. Select "YES" for all other options.
49+
5. Fill out additional fields for username and password, as well as Home Region.
50+
6. Under "OKE Cluster & VCN", select the cluster name and vcn name you found in step 2.
51+
7. Populate the subnets with the appropriate values. As a note, there is a "hint" under each field which corresponds to possible naming conventions. If your subnets are named differently, navigate back to the console page with the cluster and find them there.
52+
8. **Important**: uncheck any boxes for "add-ons" which you already have installed. The stack will fail if a box is left checked and you already have the tool installed in any namespace.
53+
- if you leave a box checked and the stack fails:
54+
- click on stack details at the top
55+
- click on variables
56+
- Click "Edit variables" box
57+
- Click "Next"
58+
- Fill the drop downs back in at the top (the rest will persist)
59+
- Uncheck the box of the previously installed application.
60+
- Click "Next"
61+
- Check the "Run apply" box"
62+
- Click "Save changes"
63+
- **Currently autoscaling requires prometheus to be installed in the `cluster-tools` namespace** and keda in the `default` namespace. This will change in an upcoming release.
64+
65+
## Step 4: Add Existing Nodes to Cluster (optional)
66+
If you have existing node pools in your original OKE cluster that you'd like Blueprints to be able to use, follow these steps after the stack is finished:
67+
68+
1. Find the private IP address of the node you'd like to add.
69+
- Console:
70+
- Go to the OKE cluster in the console like you did above
71+
- Click on "Node pools"
72+
- Click on the pool with the node you want to add
73+
- Identify the private ip address of the node under "Nodes" in the page.
74+
- Command line with `kubectl` (assumes cluster access is setup):
75+
- run `kubectl get nodes`
76+
- run `kubectl describe node <nodename>` on each node until you find the node you want to add
77+
- The private ip appears under the `Name` field of the output of `kubectl get nodes`.
78+
2. Go to the stack and click "Application information". Click the API Url.
79+
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
80+
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
81+
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
82+
5. Paste in the sample blueprint json found [here](./docs/sample_blueprints/add_node_to_control_plane.json).
83+
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
84+
7. Click "POST". This is a fast operation.
85+
8. Wait about 20 seconds and refresh the page. It should look like:
86+
```json
87+
[
88+
{
89+
"mode": "update",
90+
"recipe_id": null,
91+
"creation_date": "2025-03-28 11:12 AM UTC",
92+
"deployment_uuid": "750a________cc0bfd",
93+
"deployment_name": "startupaddnode",
94+
"deployment_status": "completed",
95+
"deployment_directive": "commission"
96+
}
97+
]
98+
```
99+
100+
## Step 5: Deploy a sample recipe
101+
2. Go to the stack and click "Application information". Click the API Url.
102+
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
103+
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
104+
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
105+
5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](./docs/sample_blueprints/vllm_inference_sample_shared_pool_blueprint.json).
106+
- Depending on the node shape, you will need to change:
107+
`"recipe_node_shape": "BM.GPU.A10.4"` to match your shape.
108+
6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](./docs/sample_blueprints/vllm_inference_sample_blueprint.json).
109+
7. Paste the blueprint you selected into context box on the deployment page and click "POST"
110+
8. To monitor the deployment, go back to "Api Root" and click "deployment_logs".
111+
- If you are deploying without a shared node pool, it can take 10-30 minutes to bring up a node, depending on shape and whether it is bare-metal or virtual.
112+
- If you are deploying with a shared node pool, the blueprint will deploy much more quickly.
113+
- It is common for a recipe to report "unhealthy" while it is deploying. This is caused by "Warnings" in the pod events when deploying to kubernetes. You only need to be alarmed when an "error" is reported.
114+
9. Wait for the following steps to complete:
115+
- Affinity / selection of node -> Directive / commission -> Command / initializing -> Canonical / name assignment -> Service -> Deployment -> Ingress -> Monitor / nominal.
116+
10. When you see the step "Monitor / nominal", you have an inference server running on your node.
117+
118+
## Step 6: Test your deployment
119+
1. Upon completion of [Step 5](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-5-deploy-a-sample-recipe), test the deployment endpoint.
120+
2. Go to Api Root, then click "deployment_digests". Find the "service_endpoint_domain" on this page.
121+
- This is <deployment-name>.<base-url>.nip.io for those who let us deploy the endpoint. If you use the default recipes above, an example of this would be:
122+
123+
`vllm-inference-deployment.158-179-30-233.nip.io`
124+
3. `curl` the metrics endpoint:
125+
```bash
126+
curl -L vllm-inference-deployment.158-179-30-233.nip.io/metrics
127+
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
128+
# TYPE vllm:cache_config_info gauge
129+
vllm:cache_config_info{block_size="16",cache_dtype="auto",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",is_attention_free="False",num_cpu_blocks="4096",num_gpu_blocks="10947",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
130+
# HELP vllm:num_requests_running Number of requests currently running on GPU.
131+
# TYPE vllm:num_requests_running gauge
132+
vllm:num_requests_running{model_name="/models/NousResearch/Meta-Llama-3.1-8B-Instruct"} 0.0
133+
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
134+
...
135+
```
136+
4. Send an actual post request:
137+
```bash
138+
curl -L -H "Content-Type: application/json" -d '{"model": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, how are you?"}], "temperature": 0.7, "max_tokens": 100 }' vllm-inference-deployment.158-179-30-233.nip.io/v1/chat/completions | jq
139+
140+
# response
141+
{
142+
"id": "chatcmpl-bb9093a3f51cee3e0ebe67ed06da59f0",
143+
"object": "chat.completion",
144+
"created": 1743169357,
145+
"model": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct",
146+
"choices": [
147+
{
148+
"index": 0,
149+
"message": {
150+
"role": "assistant",
151+
"content": "I'm doing well, thank you for asking! I'm a helpful assistant, so I'm always ready to assist you with any questions or tasks you may have. How about you? How's your day going so far?",
152+
"tool_calls": []
153+
},
154+
"logprobs": null,
155+
"finish_reason": "stop",
156+
"stop_reason": null
157+
}
158+
],
159+
"usage": {
160+
"prompt_tokens": 27,
161+
"total_tokens": 73,
162+
"completion_tokens": 46,
163+
"prompt_tokens_details": null
164+
},
165+
"prompt_logprobs": null
166+
}
167+
```
168+
5. When completed, undeploy the recipe:
169+
- go to Api Root -> deployment
170+
- Grab the whole deployment_uuid field for your deployment.
171+
- "deployment_uuid": "asdfjklafjdskl"
172+
- go to Api Root -> undeploy
173+
- paste the field "deployment_uuid" into the content box and wrap it in curly braces {}:
174+
- {"deployment_uuid": "asdfjklafjdskl"}
175+
- Click "POST"
176+
6. Monitor the undeploy:
177+
- go to Api Root -> deployment_logs
178+
- Look for: Directive decommission -> Ingress deleted -> Deployment deleted -> Service deleted -> Directive / decommission / completed.
179+
180+
## Step 7: Destroy the stack
181+
Destroying the OCI AI Blueprints stack will not destroy any resources which were created or destroyed outside of the stack such as node pools or helm installs. Only things created by the stack will be destroyed for the stack. To destroy the stack:
182+
183+
1. Go to the console and navigate to Developer Services -> Resource Manager -> Stacks -> Your OCI AI Blueprints stack
184+
2. Click "Destroy" at the top
185+
186+
## Multi-Instance GPU Setup
187+
If you have the nvidia gpu operator already installed, and would like to reconfigure it because you plan on using Multi-Instance GPUs (MIG) with your H100 nodes, you will need to manually update / reconfigure your cluster with helm.
188+
189+
This can be done like below:
190+
```bash
191+
# Get the deployment name
192+
helm list -n gpu-operator
193+
194+
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
195+
gpu-operator-1742982512 gpu-operator 1 2025-03-26 05:48:41.913183 -0400 EDT deployed gpu-operator-v24.9.2 v24.9.2
196+
197+
# Upgrade the deployment
198+
helm upgrade gpu-operator-1742982512 nvidia/gpu-operator \
199+
--namespace gpu-operator \
200+
--set mig.strategy="mixed" \
201+
--set migManager.enabled=true
202+
203+
Release "gpu-operator-1742982512" has been upgraded. Happy Helming!
204+
NAME: gpu-operator-1742982512
205+
LAST DEPLOYED: Wed Mar 26 05:59:23 2025
206+
NAMESPACE: gpu-operator
207+
STATUS: deployed
208+
REVISION: 2
209+
TEST SUITE: None
210+
```
211+
212+
213+
## Need Help?
214+
- Check out [Known Issues & Solutions](docs/known_issues/README.md) for troubleshooting common problems.
215+
- For questions or additional support, contact [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]).

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Looking to install and use OCI AI Blueprints right away? **[Click here](./GETTIN
1616

1717
We recommend following the Getting Started guide if this is your first time.
1818

19+
If you are looking to install OCI AI Blueprints onto an existing OKE cluster which already has running workloads and node pools, visit [this doc](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md).
1920
---
2021

2122
## Introduction
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"recipe_mode": "update",
3+
"deployment_name": "startupaddnode",
4+
"recipe_node_name": "10.0.10.164",
5+
"recipe_node_labels": {
6+
"corrino": "a10pool",
7+
"corrino/pool-shared-any": "true"
8+
}
9+
}

docs/sample_blueprints/vllm_inference_sample_blueprint.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"recipe_id": "llm_inference_nvidia",
33
"recipe_mode": "service",
44
"deployment_name": "vLLM Inference Deployment",
5-
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:vllmv0.6.2",
5+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:vllmv0.6.6.pos1",
66
"recipe_node_shape": "VM.GPU.A10.2",
77
"input_object_storage": [
88
{
@@ -38,5 +38,5 @@
3838
"$(tensor_parallel_size)"
3939
],
4040
"recipe_ephemeral_storage_size": 100,
41-
"recipe_shared_memory_volume_size_limit_in_mb": 200
41+
"recipe_shared_memory_volume_size_limit_in_mb": 1000
4242
}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
{
2+
"recipe_id": "llm_inference_nvidia",
3+
"recipe_mode": "service",
4+
"deployment_name": "vLLM Inference Deployment",
5+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:vllmv0.6.6.pos1",
6+
"recipe_node_shape": "BM.GPU.A10.4",
7+
"input_object_storage": [
8+
{
9+
"par": "https://objectstorage.us-ashburn-1.oraclecloud.com/p/IFknABDAjiiF5LATogUbRCcVQ9KL6aFUC1j-P5NSeUcaB2lntXLaR935rxa-E-u1/n/iduyx1qnmway/b/corrino_hf_oss_models/o/",
10+
"mount_location": "/models",
11+
"volume_size_in_gbs": 500,
12+
"include": ["NousResearch/Meta-Llama-3.1-8B-Instruct"]
13+
}
14+
],
15+
"recipe_container_env": [
16+
{
17+
"key": "tensor_parallel_size",
18+
"value": "2"
19+
},
20+
{
21+
"key": "model_name",
22+
"value": "NousResearch/Meta-Llama-3.1-8B-Instruct"
23+
},
24+
{
25+
"key": "Model_Path",
26+
"value": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"
27+
}
28+
],
29+
"recipe_replica_count": 1,
30+
"recipe_container_port": "8000",
31+
"recipe_nvidia_gpu_count": 2,
32+
"recipe_use_shared_node_pool": true,
33+
"recipe_node_boot_volume_size_in_gbs": 200,
34+
"recipe_container_command_args": [
35+
"--model",
36+
"$(Model_Path)",
37+
"--tensor-parallel-size",
38+
"$(tensor_parallel_size)"
39+
],
40+
"recipe_ephemeral_storage_size": 100,
41+
"recipe_shared_memory_volume_size_limit_in_mb": 1000
42+
}

0 commit comments

Comments
 (0)