You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/GPU/nim-gpu-oke/README.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Overview
2
2
3
-
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Kubernetes Engine (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to server Large Language Models (LLM's) in a Kubernetes architecture. The model used is Llama2-7B-chat on a GPU A10. For scalability, we are hosting the model repository on a Bucket in Oracle Object Storage.
3
+
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Kubernetes Engine (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to serve Large Language Models (LLM's) in a Kubernetes architecture. The model used is Llama2-7B-chat on a GPU A10. For scalability, we are hosting the model repository on a Bucket in Oracle Cloud Object Storage.
4
4
5
5
# Pre-requisites
6
6
@@ -9,9 +9,9 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
9
9
* You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
10
10
* You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
11
11
* Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
12
-
* You have access to NVIDIA AI Entreprise to pull the containers.
12
+
* You have access to NVIDIA AI Entreprise to pull the NIM containers.
13
13
* You are familiar with Kubernetes and Helm basic terminology.
14
-
*A HuggingFace account with an Access Token configured to download llama2-7B-chat
14
+
*You have a HuggingFace account with an Access Token configured to download llama2-7B-chat
15
15
16
16
# Walkthrough
17
17
@@ -24,7 +24,7 @@ Start a VM.GPU.A10.1 from the Instance > Compute menu with the [NGC image](https
24
24
25
25
## Update to the required NVIDIA Drivers (Optional)
26
26
27
-
It is recommended to update your drivers to the latest available following the guidance from NVIDIA with the compatibility [matrix between the drivers and your Cuda version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see See https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network for more information.
27
+
It is recommended to update your drivers to the latest available following the guidance from NVIDIA with the compatibility [matrix between the drivers and your Cuda version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network for more information.
Copy the file `model_config.yaml` and create the directory to host the model store. This is where the Model Repo Generator command will store the output.
78
+
Copy the file [`model_config.yaml`](model_config.yaml) and create the directory to host the model store. This is where the Model Repo Generator command will store the output.
79
79
80
80
```
81
81
mkdir model-store
@@ -88,7 +88,7 @@ chmod -R 777 model-store
88
88
docker run --rm -it --gpus all -v $(pwd)/model-store:/model-store -v $(pwd)/model_config.yaml:/model_config.yaml -v $(pwd)/Llama-2-7b-chat-hf:/engine_dir nvcr.io/ohlfw0olaadg/ea-participants/nemollm-inference-ms:24.02.rc4 bash -c "model_repo_generator llm --verbose --yaml_config_file=/model_config.yaml"
89
89
```
90
90
91
-
### Export the model repository to an Oracle Object Storage Bucket
91
+
### Export the model repository to an Oracle Cloud Object Storage Bucket
92
92
93
93
At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload different model configuration to the same bucket):
94
94
@@ -102,9 +102,9 @@ oci os object bulk-upload -bn NIM --src-dir . --prefix llama2-7b-hf/ --auth inst
102
102
At this stage, the model repository is uploaded to one OCI bucket. It is a good moment to try the setup.
103
103
104
104
> [!IMPORTANT]
105
-
> Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when starting the parameter. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method here with [rclone](https://rclone.org/). Make sure fuse3 is on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
105
+
> Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
106
106
107
-
Start by gathering your Namespace, Compartment OCID and region, either fetching them from the web console or by running the following commands from your compute instance:
107
+
Start by gathering your Namespace, Compartment OCID and Region, either fetching them from the web console or by running the following commands from your compute instance:
> Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to setup rclone as a service on the host and mount the bucket on startup.
161
161
162
-
In , replace the value of your namespace, compartment OCID and region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
162
+
In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID and Region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
163
163
164
-
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously. Finally, it starts rclone as a service and mount the bucket to `/opt/mnt/model_bucket_oci`.
164
+
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously on teh GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
165
165
166
166
## Deploy on OKE
167
167
@@ -176,7 +176,7 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)
176
176
Start by creating an OKE Cluster following [this tutorial](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingclusterusingoke_topic-Using_the_Console_to_create_a_Quick_Cluster_with_Default_Settings.htm) with slight adaptations:
177
177
178
178
* Start by creating 1 CPU node pool that will be used for monitoring with 1 node only (i.e VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
179
-
* Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified 
179
+
* Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
180
180
181
181
182
182
### Deploy using Helm in Cloud Shell
@@ -185,9 +185,9 @@ See [this documentation](https://docs.oracle.com/en-us/iaas/Content/API/Concepts
185
185
186
186
#### Adapting the variables
187
187
188
-
You can find the Helm configuration in */oke* where you need to adapt the *values.yaml*:
188
+
You can find the Helm configuration in the folder [`oke`](oke/) where you need to adapt the [`values.yaml`](oke/values.yaml):
189
189
190
-
Review your credentials for the [secret to pull the image](https://helm.sh/docs/howto/charts_tips_and_tricks/#creating-image-pull-secrets) in *values.yaml*:
190
+
Review your credentials for the [secret to pull the image](https://helm.sh/docs/howto/charts_tips_and_tricks/#creating-image-pull-secrets) in [`values.yaml`](oke/values.yaml):
The default load balancer created comes with a fixed shape and a bandwidth of 100Mbps. You can switch to a [flexible](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingloadbalancers-subtopic.htm#contengcreatingloadbalancers_subtopic) shape and adapt the bandwidth according to your OCI limits in case the bandwidth is a bottleneck.
210
210
211
-
An example Grafana dashboard is available in *dashboard-review.json*. Use the import function in Grafana to import and view this dashboard.
211
+
An example Grafana dashboard is available in [dashboard-review.json](oke/dashboard-review.json). Use the import function in Grafana to import and view this dashboard.
212
212
213
213
You can then see the Public IP of you grafana dashboard by running:
214
214
@@ -302,7 +302,7 @@ Resources:
302
302
303
303
*[NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
304
304
*[Deployng Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
305
-
*[NCG page with all version of NVIDIA Triton Inference Server](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags)
306
305
*[NIM documentation on how to use non prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
0 commit comments