Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# RAG with OCI, LangChain, and VLLMs

This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by an NVIDIA A10 GPU.
This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU.

Reviewed: 23.05.2024

# When to use this asset?

To run the RAG tutorial with a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by an NVIDIA A10 GPU.
To run the RAG tutorial with a local deployment of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU.

# How to use this asset?

Expand All @@ -25,7 +25,7 @@ These are the components of the Python solution being used here:

* **SitemapReader**: Asynchronous sitemap reader for the web (based on beautifulsoup). Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example, the site mapxml file is stored in an OCI bucket.
* **QdrantClient**: Python client for the Qdrant vector search engine.
* **SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
* **HuggingFaceEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
* **VLLM**: Fast and easy-to-use LLM inference server.
* **Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example, we use global configuration.
* **QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection.
Expand Down Expand Up @@ -82,23 +82,20 @@ For the sake of libraries and package compatibility, is highly recommended to up
sudo apt-get update && sudo apt-get upgrade -y
```

2. (*) Remove the current NVIDIA packages and replace them with the following versions.
2. (*) Install the latest NVIDIA drivers.

```bash
sudo apt purge nvidia* libnvidia* -y
sudo apt-get install -y cuda-drivers-545
sudo apt-get install -y nvidia-kernel-open-545
sudo apt-get install -y cuda-toolkit-12-3
sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers install --gpgpu nvidia:570-server
sudo apt install nvidia-utils-570-server
sudo reboot
```

3. (*) We make sure that `nvidia-smi` is installed in our GPU instance. If it isn't, let's install it:
3. (*) We make sure that `nvidia-smi` is installed in our GPU instance:

```bash
# run nvidia-smi
nvidia-smi
# if not found, install it.
sudo apt install nvidia-utils-510 -y
sudo apt install nvidia-driver-535 nvidia-dkms-535 -y
```

4. (*) After installation, we need to add the CUDA path to the PATH environment variable, to allow for NVCC (NVIDIA CUDA Compiler) is able to find the right CUDA executable for parallelizing and running code:
Expand Down Expand Up @@ -146,10 +143,16 @@ For the sake of libraries and package compatibility, is highly recommended to up
conda activate rag
pip install packaging
pip install -r requirements.txt
# requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/`
# requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/files`
```

9. Install `gcc` compiler to be able to build PyTorch (in vllm):

```bash
sudo apt install -y gcc
```

9. Finally, reboot the instance and reconnect via SSH.
10. Finally, reboot the instance and reconnect via SSH.

```bash
ssh -i <private.key> ubuntu@<public-ip>
Expand All @@ -158,10 +161,13 @@ For the sake of libraries and package compatibility, is highly recommended to up

## Running the solution

1. You can run an editable file with parameters to test one query by running:
1. You can run an editable file with parameters to test one query but first set a few more details, namely the `VLLM_WORKER_MULTIPROC_METHOD` environment variable and the `ipython` interactive terminal:

```bash
python rag-langchain-vllm-mistral.py
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
conda install ipython
ipython
run rag-langchain-vllm-mistral.py
```

2. If you want to run a batch of queries against Mistral with the vLLM engine, execute the following script (containing an editable list of queries):
Expand Down Expand Up @@ -210,7 +216,7 @@ Instead of:
from langchain_community.llms import VLLM

llm = VLLM(
model="mistralai/Mistral-7B-v0.1",
model="mistralai/Mistral-7B-v0.3",
...
vllm_kwargs={
...
Expand All @@ -226,7 +232,7 @@ from langchain_community.llms import VLLMOpenAI
llm = VLLMOpenAI(
openai_api_key="EMPTY",
openai_api_base="http://localhost:8000/v1",
model_name="mistralai/Mistral-7B-v0.1",
model_name="mistralai/Mistral-7B-v0.3",
model_kwargs={
...
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,84 +2,86 @@
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.readers.web import SitemapReader
from qdrant_client import QdrantClient
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.llms import VLLM, VLLMOpenAI


loader = SitemapReader(html_to_text=True)
# Reads pages from the web based on their sitemap.xml.
# Other data connectors available.
if __name__ == '__main__':

documents = loader.load_data(
sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
)
# for document in documents:
# print(document.metadata['Source'])
loader = SitemapReader(html_to_text=True)
# Reads pages from the web based on their sitemap.xml.
# Other data connectors available.

# local Docker-based instance of Qdrant
client = QdrantClient(
location=":memory:"
)
documents = loader.load_data(
sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
)

embeddings = SentenceTransformerEmbeddings(
model_name="all-MiniLM-L6-v2"
)
# local Docker-based instance of Qdrant
client = QdrantClient(
location=":memory:"
)

# local instance of Mistral 7B v0.1 using vLLM inference server
# and FlashAttention backend for performance. Model is downloaded
# from HuggingFace (no accoutn needed).
llm = VLLM(
model="mistralai/Mistral-7B-Instruct-v0.2",
gpu_memory_utilization=0.95,
tensor_parallel_size=1, # inference distributed over X GPUs
trust_remote_code=True, # mandatory for hf model
max_new_tokens=128,
top_k=10,
top_p=0.95,
temperature=0.8,
vllm_kwargs={
"swap_space": 1,
"gpu_memory_utilization": 0.95,
"max_model_len": 16384, # limitation due to unsufficient RAM
"enforce_eager": True,
},
)
#embeddings = SentenceTransformerEmbeddings(
embeddings = HuggingFaceEmbeddings(
model_name="all-MiniLM-L6-v2"
)

system_prompt="As a support engineer, your role is to leverage the information \
in the context provided. Your task is to respond to queries based strictly \
on the information available in the provided context. Do not create new \
information under any circumstances. Refrain from repeating yourself. \
Extract your response solely from the context mentioned above. \
If the context does not contain relevant information for the question, \
respond with 'How can I assist you with questions related to the document?"
# local instance of Mistral 7B v0.1 using vLLM inference server
# and FlashAttention backend for performance. Model is downloaded
# from HuggingFace (no accoutn needed).
llm = VLLM(
model="mistralai/Mistral-7B-Instruct-v0.3",
gpu_memory_utilization=0.95,
tensor_parallel_size=1, # inference distributed over X GPUs
trust_remote_code=True, # mandatory for hf model
max_new_tokens=128,
top_k=10,
top_p=0.95,
temperature=0.8,
vllm_kwargs={
"tokenizer_mode": "mistral",
"swap_space": 1,
"gpu_memory_utilization": 0.95,
"max_model_len": 16384, # limitation due to unsufficient RAM
"enforce_eager": False,
},
)

Settings.llm = llm
Settings.embed_model = embeddings
Settings.chunk_size=1000
Settings.chunk_overlap=100
Settings.num_output = 256
Settings.system_prompt=system_prompt
system_prompt="As a support engineer, your role is to leverage the information \
in the context provided. Your task is to respond to queries based strictly \
on the information available in the provided context. Do not create new \
information under any circumstances. Refrain from repeating yourself. \
Extract your response solely from the context mentioned above. \
If the context does not contain relevant information for the question, \
respond with 'How can I assist you with questions related to the document?"

vector_store = QdrantVectorStore(
client=client,
collection_name="ansh"
)
Settings.llm = llm
Settings.embed_model = embeddings
Settings.chunk_size=1000
Settings.chunk_overlap=100
Settings.num_output = 256
Settings.system_prompt=system_prompt

storage_context = StorageContext.from_defaults(
vector_store=vector_store
)
vector_store = QdrantVectorStore(
client=client,
collection_name="ansh"
)

index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
storage_context = StorageContext.from_defaults(
vector_store=vector_store
)

query_engine = index.as_query_engine(llm=llm)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)

response = query_engine.query(
'What are the document formats supported by the Vision service?'
)
query_engine = index.as_query_engine(llm=llm)

print("Response: ", response.response.strip())
for key in response.metadata.keys():
print("Source: ", response.metadata[key]['Source'])
response = query_engine.query(
'What are the document formats supported by the Vision service?'
)

print("Response: ", response.response.strip())
for key in response.metadata.keys():
print("Source: ", response.metadata[key]['Source'])
Loading