Skip to content

Commit fe5c517

Browse files
Merge pull request #1044 from oracle-devrel/rag-marketing-update
Rag marketing update
2 parents f123540 + 94cc0ab commit fe5c517

File tree

4 files changed

+92
-73
lines changed

4 files changed

+92
-73
lines changed

cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke/README.md

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44

55
## Introduction
66

7-
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Kubernetes Engine (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to serve Large Language Models (LLM's) in a Kubernetes architecture. The model used is Llama2-7B-chat on a GPU A10. For scalability, we are hosting the model repository on a Bucket in Oracle Cloud Object Storage.
7+
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Container Engine for Kubernetes (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to serve Large Language Models (LLM's) in a Kubernetes architecture.
8+
9+
The model used is `Llama2-7B-chat`, running on an NVIDIA A10 Tensor Core GPU hosted on OCI. For scalability, we are hosting the model repository on a Bucket in Oracle Cloud Object Storage.
810

911
## 0. Prerequisites & Docs
1012

@@ -15,14 +17,18 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
1517
* You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
1618
* You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
1719
* Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
18-
* You have access to NVIDIA AI Entreprise to pull the NIM containers.
19-
* You are familiar with Kubernetes and Helm basic terminology.
20-
* You have a HuggingFace account with an Access Token configured to download `llama2-7B-chat`
20+
* You have access to **NVIDIA AI Entreprise** to pull the NIM containers.
21+
* You are familiar with *Kubernetes* and *Helm* basic terminology.
22+
* You have a HuggingFace account with an Access Token configured to download `Llama2-7B-chat`.
2123

2224
> [!IMPORTANT]
25+
<<<<<<< HEAD
26+
> All tests of this walkthrough have been performed with an early access version of NVIDIA NIM for LLM's with **nemollm-inference-ms:24.02.rc4**. NVIDIA NIM has for ambition to make deployment of LLM's easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough assumes you've previously completed [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke). This is a continuation of that guide.
27+
=======
2328
> All the tests of this walkthrough have been realized with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make the deployment of LLMs easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
29+
>>>>>>> b671e8e9a1b1a9bce4222883c7cc87ba020d5f63
2430
25-
# Docs
31+
### Docs
2632

2733
* [NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
2834
* [Deploying Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
@@ -31,11 +37,11 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
3137

3238
## 1. Instance Creation
3339

34-
Let's spin up a VM with a GPU!
40+
Let's spin up a GPU instance VM on OCI!
3541

36-
1. Start a VM.GPU.A10.1 from the `Instance > Compute` menu with the [NGC image](https://docs.oracle.com/en-us/iaas/Content/Compute/References/ngcimage.htm). A boot volume of 200-250 GB is also recommended.
42+
1. Start a **VM.GPU.A10.1** instance from the `Instance > Compute` menu with the [NGC image](https://docs.oracle.com/en-us/iaas/Content/Compute/References/ngcimage.htm). A boot volume of **200-250 GB** is also recommended.
3743

38-
2. It is recommended to update your drivers to the latest available following the guidance from NVIDIA with the compatibility [matrix between the drivers and your Cuda version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network for more information.
44+
2. It is recommended to update your drivers to the latest available, following the guidance from NVIDIA with the compatibility [matrix between the drivers and your CUDA version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see [this link](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for more information.
3945

4046
```bash
4147
sudo apt purge nvidia* libnvidia*
@@ -45,22 +51,22 @@ Let's spin up a VM with a GPU!
4551
sudo reboot
4652
```
4753

48-
3. Make sure you have `nvidia-container-toolkit` installed:
54+
3. Make sure you have `nvidia-container-toolkit` installed. If not, install it by running:
4955

5056
```bash
5157
sudo apt-get install -y nvidia-container-toolkit
5258
sudo nvidia-ctk runtime configure --runtime=docker
5359
sudo systemctl restart docker
5460
```
5561

56-
4. Check that your version matches with the version you need:
62+
4. Check that your version matches with the version you need (CUDA >12.3):
5763

5864
```bash
5965
nvidia-smi
6066
/usr/local/cuda/bin/nvcc --version
6167
```
6268

63-
5. Prepare the model registry: It's possible to use [pre-built models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_playbook.html). However, we chose to run `Llama2-7B-chat` on a A10 GPU. At the time of writing, this choice is not available and we therefore have to [build the model repository ourselve](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html).
69+
5. Prepare the model registry: It's possible to use [pre-built models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_playbook.html). However, we chose to run `Llama2-7B-chat`. At the time of writing, this choice is not available and we therefore have to [build the model repository ourselves](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html).
6470
6571
6. Start by logging into the NVIDIA container registry with your username and password and pull the container:
6672
@@ -260,7 +266,7 @@ example-triton-inference-server LoadBalancer 10.18.13.28 34.83.9.133 800
260266
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
261267
262268
```bash
263-
$ curl 34.83.9.133:8000/v2
269+
curl 34.83.9.133:8000/v2
264270
```
265271
266272
From your client machine, you can now send a request to the public IP on port 8000:

cloud-infrastructure/ai-infra-gpu/AI Infrastructure/rag-langchain-vllm-mistral/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,8 @@ For the sake of libraries and package compatibility, is highly recommended to up
166166
python invoke_api.py
167167
```
168168
169+
Having these scripts been benchmarked and achieved an average of 40-60 tokens/second *without FlashAttention enabled* (which means there's room for more performance) when retrieving from the RAG system and Mistral generations (with the compute shape mentioned previously).
170+
169171
The script will return the answer to the questions asked in the query.
170172
171173
## 4. Alternative deployment
524 KB
Loading

0 commit comments

Comments
 (0)