Skip to content

Commit e403640

Browse files
Update README.md
1 parent 0dee91b commit e403640

File tree

1 file changed

+15
-15
lines changed
  • cloud-infrastructure/ai-infra-gpu/GPU/nim-gpu-oke

1 file changed

+15
-15
lines changed

cloud-infrastructure/ai-infra-gpu/GPU/nim-gpu-oke/README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Overview
22

3-
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Kubernetes Engine (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to server Large Language Models (LLM's) in a Kubernetes architecture. The model used is Llama2-7B-chat on a GPU A10. For scalability, we are hosting the model repository on a Bucket in Oracle Object Storage.
3+
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Kubernetes Engine (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to serve Large Language Models (LLM's) in a Kubernetes architecture. The model used is Llama2-7B-chat on a GPU A10. For scalability, we are hosting the model repository on a Bucket in Oracle Cloud Object Storage.
44

55
# Pre-requisites
66

@@ -9,9 +9,9 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
99
* You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
1010
* You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
1111
* Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
12-
* You have access to NVIDIA AI Entreprise to pull the containers.
12+
* You have access to NVIDIA AI Entreprise to pull the NIM containers.
1313
* You are familiar with Kubernetes and Helm basic terminology.
14-
* A HuggingFace account with an Access Token configured to download llama2-7B-chat
14+
* You have a HuggingFace account with an Access Token configured to download llama2-7B-chat
1515

1616
# Walkthrough
1717

@@ -24,7 +24,7 @@ Start a VM.GPU.A10.1 from the Instance > Compute menu with the [NGC image](https
2424

2525
## Update to the required NVIDIA Drivers (Optional)
2626

27-
It is recommended to update your drivers to the latest available following the guidance from NVIDIA with the compatibility [matrix between the drivers and your Cuda version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see See https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network for more information.
27+
It is recommended to update your drivers to the latest available following the guidance from NVIDIA with the compatibility [matrix between the drivers and your Cuda version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network for more information.
2828

2929

3030
```
@@ -75,7 +75,7 @@ git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
7575

7676
### Create the Model Config
7777

78-
Copy the file `model_config.yaml` and create the directory to host the model store. This is where the Model Repo Generator command will store the output.
78+
Copy the file [`model_config.yaml`](model_config.yaml) and create the directory to host the model store. This is where the Model Repo Generator command will store the output.
7979

8080
```
8181
mkdir model-store
@@ -88,7 +88,7 @@ chmod -R 777 model-store
8888
docker run --rm -it --gpus all -v $(pwd)/model-store:/model-store -v $(pwd)/model_config.yaml:/model_config.yaml -v $(pwd)/Llama-2-7b-chat-hf:/engine_dir nvcr.io/ohlfw0olaadg/ea-participants/nemollm-inference-ms:24.02.rc4 bash -c "model_repo_generator llm --verbose --yaml_config_file=/model_config.yaml"
8989
```
9090

91-
### Export the model repository to an Oracle Object Storage Bucket
91+
### Export the model repository to an Oracle Cloud Object Storage Bucket
9292

9393
At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload different model configuration to the same bucket):
9494

@@ -102,9 +102,9 @@ oci os object bulk-upload -bn NIM --src-dir . --prefix llama2-7b-hf/ --auth inst
102102
At this stage, the model repository is uploaded to one OCI bucket. It is a good moment to try the setup.
103103

104104
> [!IMPORTANT]
105-
> Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when starting the parameter. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method here with [rclone](https://rclone.org/). Make sure fuse3 is on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
105+
> Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
106106
107-
Start by gathering your Namespace, Compartment OCID and region, either fetching them from the web console or by running the following commands from your compute instance:
107+
Start by gathering your Namespace, Compartment OCID and Region, either fetching them from the web console or by running the following commands from your compute instance:
108108

109109
```
110110
#NAMESPACE:
@@ -159,9 +159,9 @@ curl -X "POST" 'http://localhost:9999/v1/completions' -H 'accept: application/js
159159
> [!NOTE]
160160
> Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to setup rclone as a service on the host and mount the bucket on startup.
161161
162-
In ![cloud-init](cloud-init), replace the value of your namespace, compartment OCID and region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
162+
In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID and Region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
163163

164-
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously. Finally, it starts rclone as a service and mount the bucket to `/opt/mnt/model_bucket_oci`.
164+
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously on teh GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
165165

166166
## Deploy on OKE
167167

@@ -176,7 +176,7 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)
176176
Start by creating an OKE Cluster following [this tutorial](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingclusterusingoke_topic-Using_the_Console_to_create_a_Quick_Cluster_with_Default_Settings.htm) with slight adaptations:
177177

178178
* Start by creating 1 CPU node pool that will be used for monitoring with 1 node only (i.e VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
179-
* Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified ![cloud-init script](cloud-init)
179+
* Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
180180

181181

182182
### Deploy using Helm in Cloud Shell
@@ -185,9 +185,9 @@ See [this documentation](https://docs.oracle.com/en-us/iaas/Content/API/Concepts
185185

186186
#### Adapting the variables
187187

188-
You can find the Helm configuration in */oke* where you need to adapt the *values.yaml*:
188+
You can find the Helm configuration in the folder [`oke`](oke/) where you need to adapt the [`values.yaml`](oke/values.yaml):
189189

190-
Review your credentials for the [secret to pull the image](https://helm.sh/docs/howto/charts_tips_and_tricks/#creating-image-pull-secrets) in *values.yaml*:
190+
Review your credentials for the [secret to pull the image](https://helm.sh/docs/howto/charts_tips_and_tricks/#creating-image-pull-secrets) in [`values.yaml`](oke/values.yaml):
191191

192192
```
193193
registry: nvcr.io
@@ -208,7 +208,7 @@ helm install example-metrics --set prometheus.prometheusSpec.serviceMonitorSelec
208208

209209
The default load balancer created comes with a fixed shape and a bandwidth of 100Mbps. You can switch to a [flexible](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingloadbalancers-subtopic.htm#contengcreatingloadbalancers_subtopic) shape and adapt the bandwidth according to your OCI limits in case the bandwidth is a bottleneck.
210210

211-
An example Grafana dashboard is available in *dashboard-review.json*. Use the import function in Grafana to import and view this dashboard.
211+
An example Grafana dashboard is available in [dashboard-review.json](oke/dashboard-review.json). Use the import function in Grafana to import and view this dashboard.
212212

213213
You can then see the Public IP of you grafana dashboard by running:
214214

@@ -302,7 +302,7 @@ Resources:
302302

303303
* [NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
304304
* [Deployng Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
305-
* [NCG page with all version of NVIDIA Triton Inference Server](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags)
306305
* [NIM documentation on how to use non prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
306+
* [NVIDIA TensorRT-LLM GitHub repository](https://github.com/NVIDIA/TensorRT-LLM)
307307

308308

0 commit comments

Comments
 (0)