Update README.md

AlexanderHodicke · web-flow · commit 21e7e0e5d87b · 2024-05-06T12:14:34.000+02:00
spell checking
diff --git a/cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke/README.md b/cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke/README.md
@@ -11,7 +11,7 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
 ### Prerequisites
 
 * You have access to an Oracle Cloud Tenancy.
-* You have access to shapes with NVIDIA GPU such as A10 GPU's (i.e VM.GPU.A10.1).
+* You have access to shapes with NVIDIA GPU such as A10 GPUs (i.e. VM.GPU.A10.1).
 * You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
 * You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
 * Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
@@ -20,13 +20,13 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
 * You have a HuggingFace account with an Access Token configured to download `llama2-7B-chat`
 
 > [!IMPORTANT]
-> All the tests of this walkthrough have been realised with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make deployment of LLM's easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
+> All the tests of this walkthrough have been realized with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make the deployment of LLMs easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
 
 # Docs
 
 * [NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
 * [Deploying Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
-* [NIM documentation on how to use non prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
+* [NIM documentation on how to use non-prebuilt models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html)
 * [NVIDIA TensorRT-LLM GitHub repository](https://github.com/NVIDIA/TensorRT-LLM)
 
 ## 1. Instance Creation
@@ -95,7 +95,7 @@ Let's spin up a VM with a GPU!
 
 10. Export the model repository to an Oracle Cloud Object Storage Bucket
 
-    At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload different model configuration to the same bucket):
+    At this stage, the model repository is located in the directory `model-store`. You can use `oci-cli` to do a bulk upload to one of your buckets in the region. Here is an example for a bucket called "NIM" where we want the model store to be uploaded in NIM/llama2-7b-hf (in case we upload a different model configuration to the same bucket):
 
     ```bash
     cd model-store
@@ -107,9 +107,9 @@ Let's spin up a VM with a GPU!
     At this stage, the model repository is uploaded to one OCI bucket. It is a good moment to try the setup.
 
     > [!IMPORTANT]
-    > Because the option parameter `--model-repository` is currently hardoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu you can run `sudo apt install fuse3 jq`.
+    > Because the options parameter `--model-repository` is currently hardcoded in the container, we cannot simply point to the Bucket when we start it. One option would be to adapt the Python script within the container but we would need sudo privilege. The other would be to mount the bucket as a file system on the machine directly. We chose the second method with [rclone](https://rclone.org/). Make sure fuse3 and jq are installed on the machine. On Ubuntu, you can run `sudo apt install fuse3 jq`.
 
-    Start by gathering your Namespace, Compartment OCID and Region, either fetching them from the web console or by running the following commands from your compute instance:
+    Start by gathering your Namespace, Compartment OCID, and Region, either fetching them from the web console or by running the following commands from your compute instance:
 
     ```bash
     #NAMESPACE:
@@ -152,7 +152,7 @@ Let's spin up a VM with a GPU!
     docker run --gpus all -p9999:9999 -p9998:9998 -v  $HOME/test_directory/model_bucket_oci:/model-store nvcr.io/ohlfw0olaadg/ea-participants/nemollm-inference-ms:24.02.rc4 nemollm_inference_ms --model llama2-7b-chat --openai_port="9999" --nemo_port="9998" --num_gpus 1
     ```
 
-    After 3 minutes, the inference server should be ready to serve. In another window you can run the following request. Note that if you want to run it from your local machine you will have to use the public IP and open the port 9999 at both the machine and subnet level:
+    After 3 minutes, the inference server should be ready to serve. In another window, you can run the following request. Note that if you want to run it from your local machine you will have to use the public IP and open port 9999 at both the machine and subnet levels:
 
     ```bash
     curl -X "POST" 'http://localhost:9999/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "llama2-7b-chat", "prompt": "Can you briefly describe Oracle Cloud?", "max_tokens": 100, "temperature": 0.7, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0 }' | jq ".choices[0].text"
@@ -161,11 +161,11 @@ Let's spin up a VM with a GPU!
 12. Adapt the cloud-init script
 
     > [!NOTE]
-    > Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to setup rclone as a service on the host and mount the bucket on startup.
+    > Ideally, a cleaner way of using rclone in Kubernetes would be to use the [rclone container](https://hub.docker.com/r/rclone/rclone) as a sidecar before starting the inference server. This works fine locally using docker but because it needs the `--device` option to use `fuse`, this makes it complicated to use with Kubernetes due to the lack of support for this feature (see https://github.com/kubernetes/kubernetes/issues/7890?ref=karlstoney.com, a Feature Request from 2015 still very active as of March 2024). The workaround I chose is to set up rclone as a service on the host and mount the bucket on startup.
 
-    In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID and Region lines 17, 18 and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default it is called `NIM` and has a directory called `llama2-7b-hf`.
+    In [cloud-init](cloud-init), replace the value of your Namespace, Compartment OCID, and Region lines 17, 18, and 19 with the values retrieved previously. You can also adapt the value of the bucket line 57. By default, it is called `NIM` and has a directory called `llama2-7b-hf`.
 
-    This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists in increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories and create the configuration file, the same way as we did previously on teh GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
+    This cloud-init script will be uploaded on your GPU node in your OKE cluster. The first part consists of increasing the boot volume to the value set. Then, it downloads rclone, creates the correct directories, and creates the configuration file, the same way as we did previously on the GPU VM. Finally, it starts rclone as a service and mounts the bucket to `/opt/mnt/model_bucket_oci`.
 
 ## 2. Deploy on OKE
 
@@ -179,8 +179,8 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)!
 
     Start by creating an OKE Cluster following [this tutorial](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingclusterusingoke_topic-Using_the_Console_to_create_a_Quick_Cluster_with_Default_Settings.htm) with slight adaptations:
 
-    * Create 1 CPU node pool that will be used for monitoring with 1 node only (i.e VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
-    * Once your cluster is up, create another node pool with 1 GPU node (i.e VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
+    * Create 1 CPU node pool that will be used for monitoring with 1 node only (i.e. VM.Standard.E4.Flex with 5 OCPU and 80GB RAM) with the default image.
+    * Once your cluster is up, create another node pool with 1 GPU node (i.e. VM.GPU.A10.1) with the default image coming with the GPU drivers. __*Important note*__: Make sure to increase the boot volume (350 GB) and add the previously modified [cloud-init script](cloud-init)
 
 2. Deploy using Helm in Cloud Shell
 
@@ -201,9 +201,9 @@ It is now time to bring everything together in Oracle Kubernetes Engines (OKE)!
 
 ## 3. Deploy monitoring (Grafana & Prometheus)
 
-The monitoring consist of Grafana and Prometheus pods. The configuration comes from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
+The monitoring consists of Grafana and Prometheus pods. The configuration comes from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
 
-Here we add a public Load Balancer to reach the grafana dashboard from the Internet. Use username=admin and password=prom-operator to login. The *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that Prometheus can find the inference server metrics in the example release deployed below.
+Here we add a public Load Balancer to reach the Grafana dashboard from the Internet. Use username=admin and password=prom-operator to login. The *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that Prometheus can find the inference server metrics in the example release deployed below.
 
 ```bash
 helm install example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false --set grafana.service.type=LoadBalancer prometheus-community/kube-prometheus-stack --debug
@@ -213,7 +213,7 @@ The default load balancer created comes with a fixed shape and a bandwidth of 10
 
 An example Grafana dashboard is available in [dashboard-review.json](oke/dashboard-review.json). Use the import function in Grafana to import and view this dashboard.
 
-You can then see the Public IP of you grafana dashboard by running:
+You can then see the Public IP of your Grafana dashboard by running:
 
 ```bash
 $ kubectl get svc
@@ -231,7 +231,7 @@ cd <directory containing Chart.yaml>
 helm install example . -f values.yaml --debug
 ```
 
-Use kubectl to see the status and wait until the inference server pods are running. The first pull might take a few minutes. Once the container is created, loading the model also take a few minutes. You can monitor the pod with:
+Use kubectl to see the status and wait until the inference server pods are running. The first pull might take a few minutes. Once the container is created, loading the model also takes a few minutes. You can monitor the pod with:
 
 ```bash
 kubectl describe pods <POD_NAME>
@@ -246,9 +246,9 @@ NAME                                               READY   STATUS    RESTARTS
 example-triton-inference-server-5f74b55885-n6lt7   1/1     Running   0          2m21s
 ```
 
-## 5. Using *Triton* Inference Server on you NIM container
+## 5. Using *Triton* Inference Server on your NIM container
 
-Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case it is 34.83.9.133.
+Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case, it is 34.83.9.133.
 
 ```bash
 $ kubectl get services
@@ -257,7 +257,7 @@ NAME                             TYPE           CLUSTER-IP     EXTERNAL-IP   POR
 example-triton-inference-server  LoadBalancer   10.18.13.28    34.83.9.133   8000:30249/TCP,8001:30068/TCP,8002:32723/TCP   47m
 ```
 
-The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001 and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
+The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
 
 ```bash
 $ curl 34.83.9.133:8000/v2
@@ -269,15 +269,15 @@ From your client machine, you can now send a request to the public IP on port 80
 curl -X "POST" 'http://34.83.9.133:9999/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "llama2-7b-chat", "prompt": "Can you briefly describe Oracle Cloud?", "max_tokens": 100, "temperature": 0.7, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0 }' | jq ".choices[0].text"
 ```
 
-The output should be as follow:
+The output should be as follows:
 
 ```bash
 "\n\nOracle Cloud is a comprehensive cloud computing platform offered by Oracle Corporation. It provides a wide range of cloud services, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Oracle Cloud offers a variety of benefits, including:\n\n1. Scalability: Oracle Cloud allows customers to scale their resources up or down as needed, providing the flexibility to handle changes in business demand."
 ```
 
 ## Cleaning up
 
-Once you've finished using the inference server you should use helm to delete the deployment.
+Once you've finished using the inference server you should use Helm to delete the deployment.
 
 ```bash
 $ helm list
@@ -304,7 +304,7 @@ oci os bucket delete --bucket-name NIM --empty
 
 ## Contributing
 
-This project is open source.  Please submit your contributions by forking this repository and submitting a pull request!  Oracle appreciates any contributions that are made by the open source community.
+This project is open source.  Please submit your contributions by forking this repository and submitting a pull request!  Oracle appreciates any contributions that are made by the open-source community.
 
 ## License