Skip to content

Commit 03a96be

Browse files
Update packages and commands
Some packages of the repo had to be updated to prevent from deprecation or security issues. Model has also been upgraded from v0.2 to v0.3.
1 parent 8a617c1 commit 03a96be

File tree

3 files changed

+306
-232
lines changed

3 files changed

+306
-232
lines changed

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/README.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# RAG with OCI, LangChain, and VLLMs
22

3-
This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by an NVIDIA A10 GPU.
3+
This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU.
44

55
Reviewed: 23.05.2024
66

77
# When to use this asset?
88

9-
To run the RAG tutorial with a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by an NVIDIA A10 GPU.
9+
To run the RAG tutorial with a local deployment of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU.
1010

1111
# How to use this asset?
1212

@@ -25,7 +25,7 @@ These are the components of the Python solution being used here:
2525

2626
* **SitemapReader**: Asynchronous sitemap reader for the web (based on beautifulsoup). Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example, the site mapxml file is stored in an OCI bucket.
2727
* **QdrantClient**: Python client for the Qdrant vector search engine.
28-
* **SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
28+
* **HuggingFaceEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
2929
* **VLLM**: Fast and easy-to-use LLM inference server.
3030
* **Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example, we use global configuration.
3131
* **QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection.
@@ -82,23 +82,20 @@ For the sake of libraries and package compatibility, is highly recommended to up
8282
sudo apt-get update && sudo apt-get upgrade -y
8383
```
8484

85-
2. (*) Remove the current NVIDIA packages and replace them with the following versions.
85+
2. (*) Install the latest NVIDIA drivers.
8686

8787
```bash
88-
sudo apt purge nvidia* libnvidia* -y
89-
sudo apt-get install -y cuda-drivers-545
90-
sudo apt-get install -y nvidia-kernel-open-545
91-
sudo apt-get install -y cuda-toolkit-12-3
88+
sudo apt install ubuntu-drivers-common
89+
sudo ubuntu-drivers install --gpgpu nvidia:570-server
90+
sudo apt install nvidia-utils-570-server
91+
sudo reboot
9292
```
9393

94-
3. (*) We make sure that `nvidia-smi` is installed in our GPU instance. If it isn't, let's install it:
94+
3. (*) We make sure that `nvidia-smi` is installed in our GPU instance:
9595

9696
```bash
9797
# run nvidia-smi
9898
nvidia-smi
99-
# if not found, install it.
100-
sudo apt install nvidia-utils-510 -y
101-
sudo apt install nvidia-driver-535 nvidia-dkms-535 -y
10299
```
103100

104101
4. (*) After installation, we need to add the CUDA path to the PATH environment variable, to allow for NVCC (NVIDIA CUDA Compiler) is able to find the right CUDA executable for parallelizing and running code:
@@ -146,10 +143,16 @@ For the sake of libraries and package compatibility, is highly recommended to up
146143
conda activate rag
147144
pip install packaging
148145
pip install -r requirements.txt
149-
# requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/`
146+
# requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/files`
147+
```
148+
149+
9. Install `gcc` compiler to be able to build PyTorch (in vllm):
150+
151+
```bash
152+
sudo apt install -y gcc
150153
```
151154
152-
9. Finally, reboot the instance and reconnect via SSH.
155+
10. Finally, reboot the instance and reconnect via SSH.
153156
154157
```bash
155158
ssh -i <private.key> ubuntu@<public-ip>
@@ -158,10 +161,13 @@ For the sake of libraries and package compatibility, is highly recommended to up
158161
159162
## Running the solution
160163
161-
1. You can run an editable file with parameters to test one query by running:
164+
1. You can run an editable file with parameters to test one query but first set a few more details, namely the `VLLM_WORKER_MULTIPROC_METHOD` environment variable and the `ipython` interactive terminal:
162165
163166
```bash
164-
python rag-langchain-vllm-mistral.py
167+
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
168+
conda install ipython
169+
ipython
170+
run rag-langchain-vllm-mistral.py
165171
```
166172
167173
2. If you want to run a batch of queries against Mistral with the vLLM engine, execute the following script (containing an editable list of queries):
@@ -210,7 +216,7 @@ Instead of:
210216
from langchain_community.llms import VLLM
211217
212218
llm = VLLM(
213-
model="mistralai/Mistral-7B-v0.1",
219+
model="mistralai/Mistral-7B-v0.3",
214220
...
215221
vllm_kwargs={
216222
...
@@ -226,7 +232,7 @@ from langchain_community.llms import VLLMOpenAI
226232
llm = VLLMOpenAI(
227233
openai_api_key="EMPTY",
228234
openai_api_base="http://localhost:8000/v1",
229-
model_name="mistralai/Mistral-7B-v0.1",
235+
model_name="mistralai/Mistral-7B-v0.3",
230236
model_kwargs={
231237
...
232238
},

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/files/rag-langchain-vllm-mistral.py

Lines changed: 68 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -2,84 +2,86 @@
22
from llama_index.vector_stores.qdrant import QdrantVectorStore
33
from llama_index.readers.web import SitemapReader
44
from qdrant_client import QdrantClient
5-
from langchain_community.embeddings import SentenceTransformerEmbeddings
5+
from langchain_huggingface import HuggingFaceEmbeddings
66
from langchain_community.llms import VLLM, VLLMOpenAI
77

88

9-
loader = SitemapReader(html_to_text=True)
10-
# Reads pages from the web based on their sitemap.xml.
11-
# Other data connectors available.
9+
if __name__ == '__main__':
1210

13-
documents = loader.load_data(
14-
sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
15-
)
16-
# for document in documents:
17-
# print(document.metadata['Source'])
11+
loader = SitemapReader(html_to_text=True)
12+
# Reads pages from the web based on their sitemap.xml.
13+
# Other data connectors available.
1814

19-
# local Docker-based instance of Qdrant
20-
client = QdrantClient(
21-
location=":memory:"
22-
)
15+
documents = loader.load_data(
16+
sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
17+
)
2318

24-
embeddings = SentenceTransformerEmbeddings(
25-
model_name="all-MiniLM-L6-v2"
26-
)
19+
# local Docker-based instance of Qdrant
20+
client = QdrantClient(
21+
location=":memory:"
22+
)
2723

28-
# local instance of Mistral 7B v0.1 using vLLM inference server
29-
# and FlashAttention backend for performance. Model is downloaded
30-
# from HuggingFace (no accoutn needed).
31-
llm = VLLM(
32-
model="mistralai/Mistral-7B-Instruct-v0.2",
33-
gpu_memory_utilization=0.95,
34-
tensor_parallel_size=1, # inference distributed over X GPUs
35-
trust_remote_code=True, # mandatory for hf model
36-
max_new_tokens=128,
37-
top_k=10,
38-
top_p=0.95,
39-
temperature=0.8,
40-
vllm_kwargs={
41-
"swap_space": 1,
42-
"gpu_memory_utilization": 0.95,
43-
"max_model_len": 16384, # limitation due to unsufficient RAM
44-
"enforce_eager": True,
45-
},
46-
)
24+
#embeddings = SentenceTransformerEmbeddings(
25+
embeddings = HuggingFaceEmbeddings(
26+
model_name="all-MiniLM-L6-v2"
27+
)
4728

48-
system_prompt="As a support engineer, your role is to leverage the information \
49-
in the context provided. Your task is to respond to queries based strictly \
50-
on the information available in the provided context. Do not create new \
51-
information under any circumstances. Refrain from repeating yourself. \
52-
Extract your response solely from the context mentioned above. \
53-
If the context does not contain relevant information for the question, \
54-
respond with 'How can I assist you with questions related to the document?"
29+
# local instance of Mistral 7B v0.1 using vLLM inference server
30+
# and FlashAttention backend for performance. Model is downloaded
31+
# from HuggingFace (no accoutn needed).
32+
llm = VLLM(
33+
model="mistralai/Mistral-7B-Instruct-v0.3",
34+
gpu_memory_utilization=0.95,
35+
tensor_parallel_size=1, # inference distributed over X GPUs
36+
trust_remote_code=True, # mandatory for hf model
37+
max_new_tokens=128,
38+
top_k=10,
39+
top_p=0.95,
40+
temperature=0.8,
41+
vllm_kwargs={
42+
"tokenizer_mode": "mistral",
43+
"swap_space": 1,
44+
"gpu_memory_utilization": 0.95,
45+
"max_model_len": 16384, # limitation due to unsufficient RAM
46+
"enforce_eager": False,
47+
},
48+
)
5549

56-
Settings.llm = llm
57-
Settings.embed_model = embeddings
58-
Settings.chunk_size=1000
59-
Settings.chunk_overlap=100
60-
Settings.num_output = 256
61-
Settings.system_prompt=system_prompt
50+
system_prompt="As a support engineer, your role is to leverage the information \
51+
in the context provided. Your task is to respond to queries based strictly \
52+
on the information available in the provided context. Do not create new \
53+
information under any circumstances. Refrain from repeating yourself. \
54+
Extract your response solely from the context mentioned above. \
55+
If the context does not contain relevant information for the question, \
56+
respond with 'How can I assist you with questions related to the document?"
6257

63-
vector_store = QdrantVectorStore(
64-
client=client,
65-
collection_name="ansh"
66-
)
58+
Settings.llm = llm
59+
Settings.embed_model = embeddings
60+
Settings.chunk_size=1000
61+
Settings.chunk_overlap=100
62+
Settings.num_output = 256
63+
Settings.system_prompt=system_prompt
6764

68-
storage_context = StorageContext.from_defaults(
69-
vector_store=vector_store
70-
)
65+
vector_store = QdrantVectorStore(
66+
client=client,
67+
collection_name="ansh"
68+
)
7169

72-
index = VectorStoreIndex.from_documents(
73-
documents,
74-
storage_context=storage_context
75-
)
70+
storage_context = StorageContext.from_defaults(
71+
vector_store=vector_store
72+
)
7673

77-
query_engine = index.as_query_engine(llm=llm)
74+
index = VectorStoreIndex.from_documents(
75+
documents,
76+
storage_context=storage_context
77+
)
7878

79-
response = query_engine.query(
80-
'What are the document formats supported by the Vision service?'
81-
)
79+
query_engine = index.as_query_engine(llm=llm)
8280

83-
print("Response: ", response.response.strip())
84-
for key in response.metadata.keys():
85-
print("Source: ", response.metadata[key]['Source'])
81+
response = query_engine.query(
82+
'What are the document formats supported by the Vision service?'
83+
)
84+
85+
print("Response: ", response.response.strip())
86+
for key in response.metadata.keys():
87+
print("Source: ", response.metadata[key]['Source'])

0 commit comments

Comments
 (0)