You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai/files). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by a NVIDIA A10 GPU.
* An HuggingFace account with a valid Access Token
7
+
This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai/files). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by a NVIDIA A10 GPU.
9
8
10
-
# Libraries
9
+
These are the following libraries and modules being used in this solution:
11
10
12
11
***LlamaIndex**: a data framework for LLM-based applications which benefit from context augmentation.
13
12
***LangChain**: a framework for developing applications powered by large language models.
14
13
***vLLM**: a fast and easy-to-use library for LLM inference and serving.
15
14
***Qdrant**: a vector similarity search engine.
16
15
17
-
# Mistral LLM
16
+
As we're using a Mistral model, [Mistral.ai](https://mistral.ai/) also deserves proper introduction: Mistral AI is a French AI startup that develops Large Language Models (LLMs), and one of the few companies with uncensored versions for their models (interesting to look into as a developer) Mistral 7B Instruct is a small yet powerful open model that supports English and code. The instruct version -the one we're using here- is optimized for chat.
18
17
19
-
[Mistral.ai](https://mistral.ai/) is a French AI startup that develop Large Language Models (LLMs). Mistral 7B Instruct is a small yet powerful open model that supports English and code. The Instruct version is optimized for chat. In this example, inference performance is increased using the [FlashAttention](https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention) backend.
18
+
In this example, inference performance is increased using the [FlashAttention](https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention) backend.
20
19
21
-
# Instance Creation
20
+
These are the components of the Python solution being used here:
22
21
23
-
In this example a single A10 GPU VM shape, codename VM.GPU.A10.1, is used. This is currently the smallest GPU shape available on OCI. With this configuration, it is necessary to limit the VLLM Model context length option to 16384 because the memory is unsufficient. To use the full context length, a dual A10 GPU, codename VM.GPU.A10.2, will be necessary.\
24
-
The image is the NVIDIA GPU Cloud Machine image from the OCI marketplace.\
25
-
A boot volume of 200 GB is also recommended.\
26
-
Create the the instance and connect to it once it is running.
where `public.key` is the ssh public key provided in the instance creation phase and `public.ip` is the instance Public IP address that can be found in the OCI Console.
22
+
***SitemapReader**: Asynchronous sitemap reader for web (based on beautifulsoup). Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example the site mapxml file is stored in an OCI bucket.
23
+
***QdrantClient**: Python client for the Qdrant vector search engine.
24
+
***SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
25
+
***VLLM**: Fast and easy-to-use LLM inference server.
26
+
***Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example we use global configuration.
27
+
***QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection.
28
+
***StorageContext**: Utility container for storing nodes, indices, and vectors.
29
+
***VectorStoreIndex**: Index built from the documents loaded in the Vector Store.
31
30
32
-
#Walkthrough
31
+
## 0. Prerequisites & Docs
33
32
34
-
Along this walkthrough you will be guided through the different steps of the deployment, from configuring the environment to running the different components of the RAG solution.
33
+
### Prerequisites
35
34
36
-
## Update packages and drivers
35
+
* An OCI tenancy with available credits to spend, and access to A10 GPU(s).
36
+
* A registered and verified HuggingFace account with a valid Access Token
For the sake of libraries and packages comptibility, is highly recommended to update the image packages, NVIDIA drivers and CUDA versions. First fetch, download and install the packages of the distribution (optional).
39
-
```
40
-
sudo apt-get update && sudo apt-get upgrade -y
40
+
### Docs
41
41
42
-
```
43
-
Then remove the current NVIDIA packages and replace it with the following versions.
There are two approaches here: either install everything from scratch using an Ubuntu 22 LTS OS Image, or use a marketplace image from NVIDIA, which will significantly reduce the overhead from installing all NVIDIA drivers and dependencies manually. However, these steps are also provided for those of you who want to know exactly what goes into your machine.
57
50
58
-
Once the instance is running again, clone the repository and go to the right folder:
cd technology-engineering/cloud-infrastructure/ai-infra-gpu/AI\ Infrastructure/rag-langchain-vllm-mistral/
62
-
```
63
-
Then update conda and create a virtual environment with all the required packages:
64
-
```
65
-
conda update -n base -c conda-forge conda
66
-
conda env create -f environment.yml
67
-
```
68
-
Then activate the environment.
69
-
```
70
-
conda activate rag
71
-
```
51
+
A boot volume of 200-250 GB is also recommended.
72
52
73
-
## Deploy the framework
53
+
In this example a single A10 GPU VM shape, codename `VM.GPU.A10.1`, is used. This is currently the smallest GPU shape available on OCI. With this configuration, it is recommended to limit the context length of the VLLM Model to **16384MB**, especially for larger models. To use the full context length, a dual A10 GPU, codename `VM.GPU.A10.2`, will be necessary.
74
54
75
-
### Framework components
55
+
> **Important**: If you've chosen to follow the guide with your NVIDIA GPU Cloud Machine Image (instead of a fresh Ubuntu image), you won't need to execute the steps found below in chapter 2: Setup. These steps will be marked with an asterisk (*) at the beginning so you know which ones to **skip** and which ones to execute.
76
56
77
-
***SitemapReader**: Asynchronous sitemap reader for web (based on beautifulsoup). Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example the site mapxml file is stored in an OCI bucket.
78
-
***QdrantClient**: Python client for the Qdrant vector search engine.
79
-
***SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
80
-
***VLLM**: Fast and easy-to-use LLM inference server.
81
-
***Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example we use global configuration.
82
-
***QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection.
83
-
***StorageContext**: Utility container for storing nodes, indices, and vectors.
84
-
***VectorStoreIndex**: Index built from the documents loaded in the Vector Store.
57
+
1. Create a GPU instance on OCI if you haven't already:
85
58
86
-
### Running the solution
59
+

87
60
88
-
The python script creates an all-in-one framework with local instances of the Qdrant vector similarity search engine and the vLLM inference server. First set your HuggingFace Access Token as an environment variable:
89
-
```
90
-
export HF_TOKEN=your-hf-token
91
-
```
92
-
where `your-hf-token` is you personal Access Token. It might also be necessary to validate the Mistral model access on the HuggingFace website. Then run the python script:
93
-
```
94
-
python rag-langchain-vllm-mistral.py
95
-
```
96
-
The script will return the answer to the question asked in the query.
61
+
2. Connect to the instance via SSH once the instance has been created:
62
+
63
+
```bash
64
+
ssh -i <private.key> ubuntu@<public-ip>
65
+
```
66
+
67
+
where `private.key` is the ssh private key provided in the instance creation phase and `public-ip` is the instance Public IP address that can be found in the OCI Console.
68
+
69
+
Once we have SSH access to our instance, we can proceed with the setup.
70
+
71
+
## 2. Setup
72
+
73
+
For the sake of libraries and packages compatibility, is highly recommended to update the image packages, NVIDIA drivers and CUDA versions.
74
+
75
+
1. Fetch, download and install the packages of the distribution:
97
76
98
-
## Alternative deployment
77
+
```bash
78
+
sudo apt-get update && sudo apt-get upgrade -y
79
+
```
80
+
81
+
2. (*) Remove the current NVIDIA packages and replace it with the following versions.
82
+
83
+
```bash
84
+
sudo apt purge nvidia* libnvidia* -y
85
+
sudo apt-get install -y cuda-drivers-545
86
+
sudo apt-get install -y nvidia-kernel-open-545
87
+
sudo apt-get install -y cuda-toolkit-12-3
88
+
```
89
+
90
+
3. (*) We make sure that `nvidia-smi` is installed in our GPU instance. If it isn't, let's install it:
4. (*) After installation, we need to add the CUDA path to the PATH environment variable, to allow for NVCC (NVIDIA CUDA Compiler) is able to find the right CUDA executable for parallelizing and running code:
101
+
102
+
```bash
103
+
# first, we find the CUDA /bin folder:
104
+
find / -type d -name cuda 2>/dev/null
105
+
# e.g. /usr/local/cuda-12.4/bin in Ubuntu 22
106
+
# then, we append it to the end of /home/$USER/.bashrc for consistency:
echo"">> /home/ubuntu/.bashrc # we also include a new line at the end.
116
+
```
117
+
118
+
>**Note**: if you're having issues during execution with the default Mistral 7B model, it might also be necessary to validate the Mistral model access on the HuggingFace website. If so, go [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and request permission to use the model.
119
+
120
+
6. Then, we prepare our Python environment. If it's a new machine where you don't have Conda installed, let's install Miniconda:
>**Note**: remember to accept all defaults in the Miniconda installation script. After the script has been executed, restart your shell session and you'll already have conda installed in the OS.
129
+
130
+
7. Once the instance is running again, clone the repository and go to the right folder:
cd technology-engineering/cloud-infrastructure/ai-infra-gpu/AI\ Infrastructure/rag-langchain-vllm-mistral/
135
+
git checkout rag-marketing-update # switch to this branch just in case new changes aren't synced with the main branch yet
136
+
```
137
+
138
+
8. Now, we install the necessary dependencies (in [requirements.txt](requirements.txt)) to run the environment:
139
+
140
+
```bash
141
+
conda create -n rag python=3.10
142
+
conda activate rag
143
+
pip install packaging
144
+
pip install -r requirements.txt
145
+
# requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/AI Infrastructure/rag-langchain-vllm-mistral/`
146
+
```
147
+
148
+
9. Finally, reboot the instance and reconnect via SSH.
149
+
150
+
```bash
151
+
ssh -i <private.key> ubuntu@<public-ip>
152
+
sudo reboot
153
+
```
154
+
155
+
## 3. Running the solution
156
+
157
+
1. You can run an editable file with parameters to test one query by running:
158
+
159
+
```bash
160
+
python rag-langchain-vllm-mistral.py
161
+
```
162
+
163
+
2. If you want to run a batch of queries against Mistral with the vLLM engine, execute the following script (containst an editable list of queries):
164
+
165
+
```bash
166
+
python invoke_api.py
167
+
```
168
+
169
+
The script will return the answer to the questions asked in the query.
170
+
171
+
## 4. Alternative deployment
172
+
173
+
Alternatively it is possible to deploy both components (qdrant client and vLLM server) remotely using Docker containers. This option can be useful in two situations:
99
174
100
-
Alternatively it is possible to deploy these two components remotely using Docker containers. This option can be useful in two situations:
101
175
* The engines are shared by multiple solutions for which data must segregated.
102
176
* The engines are deployed on instances with optimized configurations (GPU, RAM, CPU cores, etc.).
To deploy the container, refer to this [tutorial](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/AI%20Infrastructure/vllm-mistral).
149
229
150
-
# Notes
151
-
152
-
The libraries used in this example are evolving quite fast. The python script provided here might have to be updated in a near future to avoid Warnings and Errors.
230
+
To deploy the container, refer to this [tutorial](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/AI%20Infrastructure/vllm-mistral).
The libraries used in this example are evolving quite fast. The python script provided here might have to be updated in a near future to avoid Warnings and Errors.
0 commit comments