You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/AI Infrastructure/nim-gpu-oke/README.md
+18-12Lines changed: 18 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,9 @@
4
4
5
5
## Introduction
6
6
7
-
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Kubernetes Engine (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to serve Large Language Models (LLM's) in a Kubernetes architecture. The model used is Llama2-7B-chat on a GPU A10. For scalability, we are hosting the model repository on a Bucket in Oracle Cloud Object Storage.
7
+
This repository intends to demonstrate how to deploy [NVIDIA NIM](https://developer.nvidia.com/docs/nemo-microservices/inference/overview.html) on Oracle Container Engine for Kubernetes (OKE) with TensorRT-LLM Backend and Triton Inference Server in order to serve Large Language Models (LLM's) in a Kubernetes architecture.
8
+
9
+
The model used is `Llama2-7B-chat`, running on an NVIDIA A10 Tensor Core GPU hosted on OCI. For scalability, we are hosting the model repository on a Bucket in Oracle Cloud Object Storage.
8
10
9
11
## 0. Prerequisites & Docs
10
12
@@ -15,14 +17,18 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
15
17
* You have a [container registry](https://docs.oracle.com/en-us/iaas/Content/Registry/home.htm).
16
18
* You have an [Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) to push/pull images to/from the registry.
17
19
* Ability for your instance to authenticate via [instance principal](https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/callingservicesfrominstances.htm)
18
-
* You have access to NVIDIA AI Entreprise to pull the NIM containers.
19
-
* You are familiar with Kubernetes and Helm basic terminology.
20
-
* You have a HuggingFace account with an Access Token configured to download `llama2-7B-chat`
20
+
* You have access to **NVIDIA AI Entreprise** to pull the NIM containers.
21
+
* You are familiar with *Kubernetes* and *Helm* basic terminology.
22
+
* You have a HuggingFace account with an Access Token configured to download `Llama2-7B-chat`.
21
23
22
24
> [!IMPORTANT]
25
+
<<<<<<< HEAD
26
+
> All tests of this walkthrough have been performed with an early access version of NVIDIA NIM for LLM's with **nemollm-inference-ms:24.02.rc4**. NVIDIA NIM has for ambition to make deployment of LLM's easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough assumes you've previously completed [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke). This is a continuation of that guide.
27
+
=======
23
28
> All the tests of this walkthrough have been realized with an early access version of NVIDIA NIM for LLM's with nemollm-inference-ms:24.02.rc4. NVIDIA NIM has for ambition to make the deployment of LLMs easier compared to the previous implementation with Triton and TRT-LLM. Therefore, this walkthrough takes back the steps of [the deployment of Triton on an OKE cluster on OCI](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/triton-gpu-oke) skipping the container creation.
29
+
>>>>>>> b671e8e9a1b1a9bce4222883c7cc87ba020d5f63
24
30
25
-
# Docs
31
+
###Docs
26
32
27
33
*[NVIDIA releases NIM for deploying AI models at scale](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)
28
34
*[Deploying Triton on OCI](https://github.com/triton-inference-server/server/tree/main/deploy/oci)
@@ -31,11 +37,11 @@ This repository intends to demonstrate how to deploy [NVIDIA NIM](https://develo
31
37
32
38
## 1. Instance Creation
33
39
34
-
Let's spin up a VM with a GPU!
40
+
Let's spin up a GPU instance VM on OCI!
35
41
36
-
1. Start a VM.GPU.A10.1from the `Instance > Compute` menu with the [NGC image](https://docs.oracle.com/en-us/iaas/Content/Compute/References/ngcimage.htm). A boot volume of 200-250 GB is also recommended.
42
+
1. Start a **VM.GPU.A10.1** instance from the `Instance > Compute` menu with the [NGC image](https://docs.oracle.com/en-us/iaas/Content/Compute/References/ngcimage.htm). A boot volume of **200-250 GB** is also recommended.
37
43
38
-
2. It is recommended to update your drivers to the latest available following the guidance from NVIDIA with the compatibility [matrix between the drivers and your Cuda version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network for more information.
44
+
2. It is recommended to update your drivers to the latest available, following the guidance from NVIDIA with the compatibility [matrix between the drivers and your CUDA version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). You can also see [this link](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network) for more information.
39
45
40
46
```bash
41
47
sudo apt purge nvidia* libnvidia*
@@ -45,22 +51,22 @@ Let's spin up a VM with a GPU!
45
51
sudo reboot
46
52
```
47
53
48
-
3. Make sure you have `nvidia-container-toolkit` installed:
54
+
3. Make sure you have `nvidia-container-toolkit` installed. If not, install it by running:
4. Check that your version matches with the version you need:
62
+
4. Check that your version matches with the version you need (CUDA >12.3):
57
63
58
64
```bash
59
65
nvidia-smi
60
66
/usr/local/cuda/bin/nvcc --version
61
67
```
62
68
63
-
5. Prepare the model registry: It's possible to use [pre-built models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_playbook.html). However, we chose to run `Llama2-7B-chat` on a A10 GPU. At the time of writing, this choice is not available and we therefore have to [build the model repository ourselve](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html).
69
+
5. Prepare the model registry: It's possible to use [pre-built models](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_playbook.html). However, we chose to run `Llama2-7B-chat`. At the time of writing, this choice is not available and we therefore have to [build the model repository ourselves](https://developer.nvidia.com/docs/nemo-microservices/inference/nmi_nonprebuilt_playbook.html).
64
70
65
71
6. Start by logging into the NVIDIA container registry with your username and password and pull the container:
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. You can use curl to get the meta-data of the inference server from the HTTP endpoint.
261
267
262
268
```bash
263
-
$ curl 34.83.9.133:8000/v2
269
+
curl 34.83.9.133:8000/v2
264
270
```
265
271
266
272
From your client machine, you can now send a request to the public IP on port 8000:
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/AI Infrastructure/rag-langchain-vllm-mistral/README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,6 +166,8 @@ For the sake of libraries and package compatibility, is highly recommended to up
166
166
python invoke_api.py
167
167
```
168
168
169
+
Having these scripts been benchmarked and achieved an average of 40-60 tokens/second *without FlashAttention enabled* (which means there's room for more performance) when retrieving from the RAG system and Mistral generations (with the compute shape mentioned previously).
170
+
169
171
The script will return the answer to the questions asked in the query.
0 commit comments