Skip to content

Commit 21e7e0e

Browse files
Update README.md
spell checking
1 parent e2b6543 commit 21e7e0e

File tree

1 file changed

+22
-22
lines changed
  • cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke

1 file changed

+22
-22
lines changed

cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke/README.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
1111
### Prerequisites
1212

1313
* You have access to an Oracle Cloud Tenancy.
14-
* You have access to shapes with NVIDIA GPU such as A10 GPU's (i.e VM.GPU.A10.1).
14+
* You have access to shapes with NVIDIA GPU such as A10 GPUs (i.e. VM.GPU.A10.1).
1515
* You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
1616
* You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
1717
* Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
@@ -20,13 +20,13 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
2020
* You have a HuggingFace account with an Access Token configured to download `llama2-7B-chat`
2121

2222
> [!IMPORTANT]
23-
> All the tests of this walkthrough have been realised with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make deployment of LLM's easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
23+
> All the tests of this walkthrough have been realized with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make the deployment of LLMs easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
2424
2525
# Docs
2626

2727
* [NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
2828
* [Deploying Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
29-
* [NIM documentation on how to use non prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
29+
* [NIM documentation on how to use non-prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
3030
* [NVIDIA TensorRT-LLM GitHub repository](https://github.com/NVIDIA/TensorRT-LLM)
3131

3232
## 1. Instance Creation
@@ -95,7 +95,7 @@ Let's spin up a VM with a GPU!
9595
9696
10. Export the model repository to an Oracle Cloud Object Storage Bucket
9797
98-
At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload different model configuration to the same bucket):
98+
At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload a different model configuration to the same bucket):
9999
100100
```bash
101101
cd model-store
@@ -107,9 +107,9 @@ Let's spin up a VM with a GPU!
107107
At this stage, the model repository is uploaded to one OCI bucket. It is a good moment to try the setup.
108108
109109
> [!IMPORTANT]
110-
> Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
110+
> Because the options parameter `--model-repository` is currently hardcoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the Python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu, you can run `sudo apt install fuse3 jq`.
111111
112-
Start by gathering your Namespace, Compartment OCID and Region, either fetching them from the web console or by running the following commands from your compute instance:
112+
Start by gathering your Namespace, Compartment OCID, and Region, either fetching them from the web console or by running the following commands from your compute instance:
113113
114114
```bash
115115
#NAMESPACE:
@@ -152,7 +152,7 @@ Let's spin up a VM with a GPU!
152152
docker run --gpus all -p9999:9999 -p9998:9998 -v $HOME/test_directory/model_bucket_oci:/model-store nvcr.io/ohlfw0olaadg/ea-participants/nemollm-inference-ms:24.02.rc4 nemollm_inference_ms --model llama2-7b-chat --openai_port="9999" --nemo_port="9998" --num_gpus 1
153153
```
154154
155-
After 3 minutes, the inference server should be ready to serve. In another window you can run the following request. Note that if you want to run it from your local machine you will have to use the public IP and open the port 9999 at both the machine and subnet level:
155+
After 3 minutes, the inference server should be ready to serve. In another window, you can run the following request. Note that if you want to run it from your local machine you will have to use the public IP and open port 9999 at both the machine and subnet levels:
156156
157157
```bash
158158
curl -X "POST" 'http://localhost:9999/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "llama2-7b-chat", "prompt": "Can you briefly describe Oracle Cloud?", "max_tokens": 100, "temperature": 0.7, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0 }' | jq ".choices[0].text"
@@ -161,11 +161,11 @@ Let's spin up a VM with a GPU!
161161
12. Adapt the cloud-init script
162162
163163
> [!NOTE]
164-
> Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to setup rclone as a service on the host and mount the bucket on startup.
164+
> Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to set up rclone as a service on the host and mount the bucket on startup.
165165
166-
In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID and Region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
166+
In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID, and Region lines 17, 18, and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default, it is called `NIM` and has a directory called `llama2-7b-hf`.
167167
168-
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously on teh GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
168+
This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists of increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories, and creates the configuration file, the same way as we did previously on the GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
169169
170170
## 2. Deploy on OKE
171171
@@ -179,8 +179,8 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)!
179179
180180
Start by creating an OKE Cluster following [this tutorial](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingclusterusingoke_topic-Using_the_Console_to_create_a_Quick_Cluster_with_Default_Settings.htm) with slight adaptations:
181181
182-
* Create 1 CPU node pool that will be used for monitoring with 1 node only (i.e VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
183-
* Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
182+
* Create 1 CPU node pool that will be used for monitoring with 1 node only (i.e. VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
183+
* Once your cluster is up, create another node pool with 1 GPU node (i.e. VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
184184
185185
2. Deploy using Helm in Cloud Shell
186186
@@ -201,9 +201,9 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)!
201201
202202
## 3. Deploy monitoring (Grafana & Prometheus)
203203
204-
The monitoring consist of Grafana and Prometheus pods. The configuration comes from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
204+
The monitoring consists of Grafana and Prometheus pods. The configuration comes from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
205205
206-
Here we add a public Load Balancer to reach the grafana dashboard from the Internet. Use username=admin and password=prom-operator to login. The *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that Prometheus can find the inference server metrics in the example release deployed below.
206+
Here we add a public Load Balancer to reach the Grafana dashboard from the Internet. Use username=admin and password=prom-operator to login. The *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that Prometheus can find the inference server metrics in the example release deployed below.
207207
208208
```bash
209209
helm install example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false --set grafana.service.type=LoadBalancer prometheus-community/kube-prometheus-stack --debug
@@ -213,7 +213,7 @@ The default load balancer created comes with a fixed shape and a bandwidth of 10
213213
214214
An example Grafana dashboard is available in [dashboard-review.json](oke/dashboard-review.json). Use the import function in Grafana to import and view this dashboard.
215215
216-
You can then see the Public IP of you grafana dashboard by running:
216+
You can then see the Public IP of your Grafana dashboard by running:
217217
218218
```bash
219219
$ kubectl get svc
@@ -231,7 +231,7 @@ cd <directory containing Chart.yaml>
231231
helm install example . -f values.yaml --debug
232232
```
233233
234-
Use kubectl to see the status and wait until the inference server pods are running. The first pull might take a few minutes. Once the container is created, loading the model also take a few minutes. You can monitor the pod with:
234+
Use kubectl to see the status and wait until the inference server pods are running. The first pull might take a few minutes. Once the container is created, loading the model also takes a few minutes. You can monitor the pod with:
235235
236236
```bash
237237
kubectl describe pods <POD_NAME>
@@ -246,9 +246,9 @@ NAME READY STATUS RESTARTS
246246
example-triton-inference-server-5f74b55885-n6lt7 1/1 Running 0 2m21s
247247
```
248248
249-
## 5. Using *Triton* Inference Server on you NIM container
249+
## 5. Using *Triton* Inference Server on your NIM container
250250
251-
Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case it is 34.83.9.133.
251+
Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case, it is 34.83.9.133.
252252
253253
```bash
254254
$ kubectl get services
@@ -257,7 +257,7 @@ NAME TYPE CLUSTER-IP EXTERNAL-IP POR
257257
example-triton-inference-server LoadBalancer 10.18.13.28 34.83.9.133 8000:30249/TCP,8001:30068/TCP,8002:32723/TCP 47m
258258
```
259259
260-
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001 and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
260+
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
261261
262262
```bash
263263
$ curl 34.83.9.133:8000/v2
@@ -269,15 +269,15 @@ From your client machine, you can now send a request to the public IP on port 80
269269
curl -X "POST" 'http://34.83.9.133:9999/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "llama2-7b-chat", "prompt": "Can you briefly describe Oracle Cloud?", "max_tokens": 100, "temperature": 0.7, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0 }' | jq ".choices[0].text"
270270
```
271271
272-
The output should be as follow:
272+
The output should be as follows:
273273
274274
```bash
275275
"\n\nOracle Cloud is a comprehensive cloud computing platform offered by Oracle Corporation. It provides a wide range of cloud services, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Oracle Cloud offers a variety of benefits, including:\n\n1. Scalability: Oracle Cloud allows customers to scale their resources up or down as needed, providing the flexibility to handle changes in business demand."
276276
```
277277
278278
## Cleaning up
279279
280-
Once you've finished using the inference server you should use helm to delete the deployment.
280+
Once you've finished using the inference server you should use Helm to delete the deployment.
281281

282282
```bash
283283
$ helm list
@@ -304,7 +304,7 @@ oci os bucket delete --bucket-name NIM --empty
304304

305305
## Contributing
306306

307-
This project is open source. Please submit your contributions by forking this repository and submitting a pull request! Oracle appreciates any contributions that are made by the open source community.
307+
This project is open source. Please submit your contributions by forking this repository and submitting a pull request! Oracle appreciates any contributions that are made by the open-source community.
308308

309309
## License
310310

0 commit comments

Comments
 (0)