You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/AI Infrastructure/rag-langchain-vllm-mistral/README.md
+59-22Lines changed: 59 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,57 +5,76 @@ This repository is a variant of the Retrieval Augmented Generation (RAG) tutoria
5
5
# Requirements
6
6
7
7
* An OCI tenancy with A10 GPU quota.
8
+
* An HuggingFace account with a valid Access Token
8
9
9
10
# Libraries
10
11
11
12
***LlamaIndex**: a data framework for LLM-based applications which benefit from context augmentation.
12
-
***LangChai**: a framework for developing applications powered by large language models.
13
+
***LangChain**: a framework for developing applications powered by large language models.
13
14
***vLLM**: a fast and easy-to-use library for LLM inference and serving.
14
15
***Qdrant**: a vector similarity search engine.
15
16
16
17
# Mistral LLM
17
18
18
19
[Mistral.ai](https://mistral.ai/) is a French AI startup that develop Large Language Models (LLMs). Mistral 7B Instruct is a small yet powerful open model that supports English and code. The Instruct version is optimized for chat. In this example, inference performance is increased using the [FlashAttention](https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention) backend.
19
20
20
-
# Instance Configuration
21
+
# Instance Creation
21
22
22
-
In this example a single A10 GPU VM shape, codename VM.GPU.A10.1, is used. This is currently the smallest GPU shape available on OCI. With this configuration, it is necessary to limit the VLLM Model context length option to 16384 because the memory is unsufficient. To use the full context length, a dual A10 GPU, codename VM.GPU.A10.2, will be necessary.
23
-
The image is the NVIDIA GPU Cloud Machine image from the OCI marketplace.
24
-
A boot volume of 200 GB is also recommended.
23
+
In this example a single A10 GPU VM shape, codename VM.GPU.A10.1, is used. This is currently the smallest GPU shape available on OCI. With this configuration, it is necessary to limit the VLLM Model context length option to 16384 because the memory is unsufficient. To use the full context length, a dual A10 GPU, codename VM.GPU.A10.2, will be necessary.\
24
+
The image is the NVIDIA GPU Cloud Machine image from the OCI marketplace.\
25
+
A boot volume of 200 GB is also recommended.\
26
+
Create the the instance and connect to it once it is running.
where `public.key` is the ssh public key provided in the instance creation phase and `public.ip` is the instance Public IP address that can be found in the OCI Console.
31
+
32
+
# Walkthrough
33
+
34
+
Along this walkthrough you will be guided through the different steps of the deployment, from configuring the environment to running the different components of the RAG solution.
25
35
26
-
#Image Update
36
+
## Update packages and drivers
27
37
28
-
For the sake of libraries age support, is highly recommended to update NVIDIA drivers and CUDA by running:
38
+
For the sake of libraries and packages comptibility, is highly recommended to update the image packages, NVIDIA drivers and CUDA versions. First fetch, download and install the packages of the distribution (optional).
39
+
```
40
+
sudo apt-get update && sudo apt-get upgrade -y
41
+
42
+
```
43
+
Then remove the current NVIDIA packages and replace it with the following versions.
29
44
30
45
```
31
-
sudo apt purge nvidia* libnvidia*
46
+
sudo apt purge nvidia* libnvidia* -y
32
47
sudo apt-get install -y cuda-drivers-545
33
48
sudo apt-get install -y nvidia-kernel-open-545
34
49
sudo apt-get install -y cuda-toolkit-12-3
50
+
```
51
+
Finally reboot the instance.
52
+
```
35
53
sudo reboot
36
54
```
37
55
38
-
# Framework deployment
39
-
40
-
## Install packages
41
-
42
-
First setup a virtual environment with all the required packages:
56
+
## Configure environment
43
57
58
+
Once the instance is running again, clone the repository and go to the right folder:
cd technology-engineering/cloud-infrastructure/ai-infra-gpu/AI\ Infrastructure/rag-langchain-vllm-mistral/
62
+
```
63
+
Then update conda and create a virtual environment with all the required packages:
64
+
```
65
+
conda update -n base -c conda-forge conda
66
+
conda env create -f environment.yml
44
67
```
45
-
python -m venv rag
46
-
pip install -r requirements.txt
47
-
source rag/bin/activate
68
+
Then activate the environment.
69
+
```
70
+
conda activate rag
48
71
```
49
72
50
73
## Deploy the framework
51
74
52
-
The python script creates an all-in-one framework with local instances of the Qdrant vector similarity search engine and the vLLM inference server. Alternatively it is possible to deploy these two components remotely using Docker containers. This option can be useful in two situations:
53
-
* The engines are shared by multiple solutions for which data must segregated.
54
-
* The engines are deployed on instances with optimized configurations (GPU, RAM, CPU cores, etc.).
55
-
56
75
### Framework components
57
76
58
-
***SitemapReader**: Asynchronous sitemap reader for web. Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example the site mapxml file is stored in an OCI bucket.
77
+
***SitemapReader**: Asynchronous sitemap reader for web (based on beautifulsoup). Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example the site mapxml file is stored in an OCI bucket.
59
78
***QdrantClient**: Python client for the Qdrant vector search engine.
60
79
***SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
61
80
***VLLM**: Fast and easy-to-use LLM inference server.
@@ -64,6 +83,24 @@ The python script creates an all-in-one framework with local instances of the Qd
64
83
***StorageContext**: Utility container for storing nodes, indices, and vectors.
65
84
***VectorStoreIndex**: Index built from the documents loaded in the Vector Store.
66
85
86
+
### Running the solution
87
+
88
+
The python script creates an all-in-one framework with local instances of the Qdrant vector similarity search engine and the vLLM inference server. First set your HuggingFace Access Token as an environment variable:
89
+
```
90
+
export HF_TOKEN=your-hf-token
91
+
```
92
+
where `your-hf-token` is you personal Access Token. It might also be necessary to validate the Mistral model access on the HuggingFace website. Then run the python script:
93
+
```
94
+
python rag-langchain-vllm-mistral.py
95
+
```
96
+
The script will return the answer to the question asked in the query.
97
+
98
+
## Alternative deployment
99
+
100
+
Alternatively it is possible to deploy these two components remotely using Docker containers. This option can be useful in two situations:
101
+
* The engines are shared by multiple solutions for which data must segregated.
102
+
* The engines are deployed on instances with optimized configurations (GPU, RAM, CPU cores, etc.).
103
+
67
104
### Remote Qdrant client
68
105
69
106
Instead of:
@@ -108,7 +145,7 @@ llm = VLLMOpenAI(
108
145
},
109
146
)
110
147
```
111
-
To deploy the container, refer to this [tutorial](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/vllm-mistral).
148
+
To deploy the container, refer to this [tutorial](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/AI%20Infrastructure/vllm-mistral).
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/AI Infrastructure/rag-langchain-vllm-mistral/rag-langchain-vllm-mistral.py
0 commit comments