You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke/README.md
+22-22Lines changed: 22 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
11
11
### Prerequisites
12
12
13
13
* You have access to an Oracle Cloud Tenancy.
14
-
* You have access to shapes with NVIDIA GPU such as A10 GPU's (i.e VM.GPU.A10.1).
14
+
* You have access to shapes with NVIDIA GPU such as A10 GPUs (i.e. VM.GPU.A10.1).
15
15
* You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
16
16
* You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
17
17
* Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
@@ -20,13 +20,13 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
20
20
* You have a HuggingFace account with an Access Token configured to download `llama2-7B-chat`
21
21
22
22
> [!IMPORTANT]
23
-
> All the tests of this walkthrough have been realised with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make deployment of LLM's easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
23
+
> All the tests of this walkthrough have been realized with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make the deployment of LLMs easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
24
24
25
25
# Docs
26
26
27
27
*[NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
28
28
*[Deploying Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
29
-
*[NIM documentation on how to use nonprebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
29
+
*[NIM documentation on how to use non-prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
10. Export the model repository to an Oracle Cloud Object Storage Bucket
97
97
98
-
At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload different model configuration to the same bucket):
98
+
At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload a different model configuration to the same bucket):
99
99
100
100
```bash
101
101
cd model-store
@@ -107,9 +107,9 @@ Let's spin up a VM with a GPU!
107
107
At this stage, the model repository is uploaded to one OCI bucket. It is a good moment to try the setup.
108
108
109
109
> [!IMPORTANT]
110
-
> Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
110
+
> Because the options parameter `--model-repository` is currently hardcoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the Python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu, you can run `sudo apt install fuse3 jq`.
111
111
112
-
Start by gathering your Namespace, Compartment OCID and Region, either fetching them from the web console or by running the following commands from your compute instance:
112
+
Start by gathering your Namespace, Compartment OCID, and Region, either fetching them from the web console or by running the following commands from your compute instance:
113
113
114
114
```bash
115
115
#NAMESPACE:
@@ -152,7 +152,7 @@ Let's spin up a VM with a GPU!
152
152
docker run --gpus all -p9999:9999 -p9998:9998 -v $HOME/test_directory/model_bucket_oci:/model-store nvcr.io/ohlfw0olaadg/ea-participants/nemollm-inference-ms:24.02.rc4 nemollm_inference_ms --model llama2-7b-chat --openai_port="9999" --nemo_port="9998" --num_gpus 1
153
153
```
154
154
155
-
After 3 minutes, the inference server should be ready to serve. In another window you can run the following request. Note that if you want to run it from your local machine you will have to use the public IP and open the port 9999 at both the machine and subnet level:
155
+
After 3 minutes, the inference server should be ready to serve. In another window, you can run the following request. Note that if you want to run it from your local machine you will have to use the public IP and open port 9999 at both the machine and subnet levels:
@@ -161,11 +161,11 @@ Let's spin up a VM with a GPU!
161
161
12. Adapt the cloud-init script
162
162
163
163
> [!NOTE]
164
-
> Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to setup rclone as a service on the host and mount the bucket on startup.
164
+
> Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to set up rclone as a service on the host and mount the bucket on startup.
165
165
166
-
In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID and Region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
166
+
In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID, and Region lines 17, 18, and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default, it is called `NIM` and has a directory called `llama2-7b-hf`.
167
167
168
-
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously on teh GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
168
+
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists of increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories, and creates the configuration file, the same way as we did previously on the GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
169
169
170
170
## 2. Deploy on OKE
171
171
@@ -179,8 +179,8 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)!
179
179
180
180
Start by creating an OKE Cluster following [this tutorial](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingclusterusingoke_topic-Using_the_Console_to_create_a_Quick_Cluster_with_Default_Settings.htm) with slight adaptations:
181
181
182
-
* Create 1 CPU node pool that will be used for monitoring with 1 node only (i.e VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
183
-
* Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
182
+
* Create 1 CPU node pool that will be used for monitoring with 1 node only (i.e. VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
183
+
* Once your cluster is up, create another node pool with 1 GPU node (i.e. VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
184
184
185
185
2. Deploy using Helm in Cloud Shell
186
186
@@ -201,9 +201,9 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)!
201
201
202
202
## 3. Deploy monitoring (Grafana & Prometheus)
203
203
204
-
The monitoring consist of Grafana and Prometheus pods. The configuration comes from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
204
+
The monitoring consists of Grafana and Prometheus pods. The configuration comes from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
205
205
206
-
Here we add a public Load Balancer to reach the grafana dashboard from the Internet. Use username=admin and password=prom-operator to login. The *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that Prometheus can find the inference server metrics in the example release deployed below.
206
+
Here we add a public Load Balancer to reach the Grafana dashboard from the Internet. Use username=admin and password=prom-operator to login. The *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that Prometheus can find the inference server metrics in the example release deployed below.
@@ -213,7 +213,7 @@ The default load balancer created comes with a fixed shape and a bandwidth of 10
213
213
214
214
An example Grafana dashboard is available in [dashboard-review.json](oke/dashboard-review.json). Use the import function in Grafana to import and view this dashboard.
215
215
216
-
You can then see the Public IP of you grafana dashboard by running:
216
+
You can then see the Public IP of your Grafana dashboard by running:
217
217
218
218
```bash
219
219
$ kubectl get svc
@@ -231,7 +231,7 @@ cd <directory containing Chart.yaml>
231
231
helm install example . -f values.yaml --debug
232
232
```
233
233
234
-
Use kubectl to see the status and wait until the inference server pods are running. The first pull might take a few minutes. Once the container is created, loading the model also take a few minutes. You can monitor the pod with:
234
+
Use kubectl to see the status and wait until the inference server pods are running. The first pull might take a few minutes. Once the container is created, loading the model also takes a few minutes. You can monitor the pod with:
## 5. Using *Triton* Inference Server on you NIM container
249
+
## 5. Using *Triton* Inference Server on your NIM container
250
250
251
-
Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case it is 34.83.9.133.
251
+
Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case, it is 34.83.9.133.
252
252
253
253
```bash
254
254
$ kubectl get services
@@ -257,7 +257,7 @@ NAME TYPE CLUSTER-IP EXTERNAL-IP POR
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001 and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
260
+
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
261
261
262
262
```bash
263
263
$ curl 34.83.9.133:8000/v2
@@ -269,15 +269,15 @@ From your client machine, you can now send a request to the public IP on port 80
"\n\nOracle Cloud is a comprehensive cloud computing platform offered by Oracle Corporation. It provides a wide range of cloud services, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Oracle Cloud offers a variety of benefits, including:\n\n1. Scalability: Oracle Cloud allows customers to scale their resources up or down as needed, providing the flexibility to handle changes in business demand."
276
276
```
277
277
278
278
## Cleaning up
279
279
280
-
Once you've finished using the inference server you should use helm to delete the deployment.
280
+
Once you've finished using the inference server you should use Helm to delete the deployment.
281
281
282
282
```bash
283
283
$ helm list
@@ -304,7 +304,7 @@ oci os bucket delete --bucket-name NIM --empty
304
304
305
305
## Contributing
306
306
307
-
This project is open source. Please submit your contributions by forking this repository and submitting a pull request! Oracle appreciates any contributions that are made by the opensource community.
307
+
This project is open source. Please submit your contributions by forking this repository and submitting a pull request! Oracle appreciates any contributions that are made by the open-source community.
0 commit comments