oracle-devrel · bruno-garbaccio · Apr 10, 2025 · Apr 10, 2025
diff --git a/...rastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/README.md b/...rastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/README.md
@@ -1,12 +1,12 @@
 # RAG with OCI, LangChain, and VLLMs
 
-This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by an NVIDIA A10 GPU.
+This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU.
 
 Reviewed: 23.05.2024
 
 # When to use this asset?
 
-To run the RAG tutorial with a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by an NVIDIA A10 GPU.
+To run the RAG tutorial with a local deployment of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU.
 
 # How to use this asset?
 
@@ -25,7 +25,7 @@ These are the components of the Python solution being used here:
 
 * **SitemapReader**: Asynchronous sitemap reader for the web (based on beautifulsoup). Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example, the site mapxml file is stored in an OCI bucket.
 * **QdrantClient**: Python client for the Qdrant vector search engine.
-* **SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
+* **HuggingFaceEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
 * **VLLM**: Fast and easy-to-use LLM inference server.
 * **Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example, we use global configuration.
 * **QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection.
@@ -82,23 +82,20 @@ For the sake of libraries and package compatibility, is highly recommended to up
     sudo apt-get update && sudo apt-get upgrade -y
     ```
 
-2. (*) Remove the current NVIDIA packages and replace them with the following versions.
+2. (*) Install the latest NVIDIA drivers.
 
     ```bash
-    sudo apt purge nvidia* libnvidia* -y
-    sudo apt-get install -y cuda-drivers-545
-    sudo apt-get install -y nvidia-kernel-open-545
-    sudo apt-get install -y cuda-toolkit-12-3
+    sudo apt install ubuntu-drivers-common
+    sudo ubuntu-drivers install --gpgpu nvidia:570-server
+    sudo apt install nvidia-utils-570-server
+    sudo reboot
     ```
 
-3. (*) We make sure that `nvidia-smi` is installed in our GPU instance. If it isn't, let's install it:
+3. (*) We make sure that `nvidia-smi` is installed in our GPU instance:
 
     ```bash
     # run nvidia-smi
     nvidia-smi
-    # if not found, install it.
-    sudo apt install nvidia-utils-510 -y 
-    sudo apt install nvidia-driver-535 nvidia-dkms-535 -y
     ```
 
 4. (*) After installation, we need to add the CUDA path to the PATH environment variable, to allow for NVCC (NVIDIA CUDA Compiler) is able to find the right CUDA executable for parallelizing and running code:
@@ -146,10 +143,16 @@ For the sake of libraries and package compatibility, is highly recommended to up
     conda activate rag
     pip install packaging
     pip install -r requirements.txt
-    # requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/`
+    # requirements.txt can be found in `technology-engineering/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/files`
+    ```
+
+9. Install `gcc` compiler to be able to build PyTorch (in vllm):
+
+    ```bash
+    sudo apt install -y gcc
     ```
 
-9. Finally, reboot the instance and reconnect via SSH.
+10. Finally, reboot the instance and reconnect via SSH.
 
     ```bash
     ssh -i <private.key> ubuntu@<public-ip>
@@ -158,10 +161,13 @@ For the sake of libraries and package compatibility, is highly recommended to up
 
 ## Running the solution
 
-1. You can run an editable file with parameters to test one query by running:
+1. You can run an editable file with parameters to test one query but first set a few more details, namely the `VLLM_WORKER_MULTIPROC_METHOD` environment variable and the `ipython` interactive terminal:
 
     ```bash
-    python rag-langchain-vllm-mistral.py
+    export VLLM_WORKER_MULTIPROC_METHOD="spawn"
+    conda install ipython
+    ipython
+    run rag-langchain-vllm-mistral.py
     ```
 
 2. If you want to run a batch of queries against Mistral with the vLLM engine, execute the following script (containing an editable list of queries):
@@ -210,7 +216,7 @@ Instead of:
 from langchain_community.llms import VLLM
 
 llm = VLLM(
-    model="mistralai/Mistral-7B-v0.1",
+    model="mistralai/Mistral-7B-v0.3",
     ...
     vllm_kwargs={
         ...
@@ -226,7 +232,7 @@ from langchain_community.llms import VLLMOpenAI
 llm = VLLMOpenAI(
     openai_api_key="EMPTY",
     openai_api_base="http://localhost:8000/v1",
-    model_name="mistralai/Mistral-7B-v0.1",
+    model_name="mistralai/Mistral-7B-v0.3",
     model_kwargs={
         ...
     },

diff --git a/...nfra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/files/rag-langchain-vllm-mistral.py b/...nfra-gpu/ai-infrastructure/rag-langchain-vllm-mistral/files/rag-langchain-vllm-mistral.py
@@ -2,84 +2,86 @@
 from llama_index.vector_stores.qdrant import QdrantVectorStore
 from llama_index.readers.web import SitemapReader
 from qdrant_client import QdrantClient
-from langchain_community.embeddings import SentenceTransformerEmbeddings
+from langchain_huggingface import HuggingFaceEmbeddings
 from langchain_community.llms import VLLM, VLLMOpenAI
 
 
-loader = SitemapReader(html_to_text=True)
-# Reads pages from the web based on their sitemap.xml.
-# Other data connectors available.
+if __name__ == '__main__':
 
-documents = loader.load_data(
-    sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
-)
-# for document in documents:
-#    print(document.metadata['Source'])
+    loader = SitemapReader(html_to_text=True)
+    # Reads pages from the web based on their sitemap.xml.
+    # Other data connectors available.
 
-# local Docker-based instance of Qdrant
-client = QdrantClient(
-    location=":memory:"
-)
+    documents = loader.load_data(
+        sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
+    )
 
-embeddings = SentenceTransformerEmbeddings(
-    model_name="all-MiniLM-L6-v2"
-)
+    # local Docker-based instance of Qdrant
+    client = QdrantClient(
+        location=":memory:"
+    )
 
-# local instance of Mistral 7B v0.1 using vLLM inference server
-# and FlashAttention backend for performance. Model is downloaded
-# from HuggingFace (no accoutn needed).
-llm = VLLM(
-    model="mistralai/Mistral-7B-Instruct-v0.2",
-    gpu_memory_utilization=0.95,
-    tensor_parallel_size=1, # inference distributed over X GPUs
-    trust_remote_code=True, # mandatory for hf model
-    max_new_tokens=128,
-    top_k=10,
-    top_p=0.95,
-    temperature=0.8,
-    vllm_kwargs={
-        "swap_space": 1,
-        "gpu_memory_utilization": 0.95,
-        "max_model_len": 16384, # limitation due to unsufficient RAM
-        "enforce_eager": True,
-    },
-)
+    #embeddings = SentenceTransformerEmbeddings(
+    embeddings = HuggingFaceEmbeddings(
+        model_name="all-MiniLM-L6-v2"
+    )
 
-system_prompt="As a support engineer, your role is to leverage the information \
-    in the context provided. Your task is to respond to queries based strictly \
-    on the information available in the provided context. Do not create new \
-    information under any circumstances. Refrain from repeating yourself. \
-    Extract your response solely from the context mentioned above. \
-    If the context does not contain relevant information for the question, \
-    respond with 'How can I assist you with questions related to the document?"
+    # local instance of Mistral 7B v0.1 using vLLM inference server
+    # and FlashAttention backend for performance. Model is downloaded
+    # from HuggingFace (no accoutn needed).
+    llm = VLLM(
+        model="mistralai/Mistral-7B-Instruct-v0.3",
+        gpu_memory_utilization=0.95,
+        tensor_parallel_size=1, # inference distributed over X GPUs
+        trust_remote_code=True, # mandatory for hf model
+        max_new_tokens=128,
+        top_k=10,
+        top_p=0.95,
+        temperature=0.8,
+        vllm_kwargs={
+            "tokenizer_mode": "mistral",
+            "swap_space": 1,
+            "gpu_memory_utilization": 0.95,
+            "max_model_len": 16384, # limitation due to unsufficient RAM
+            "enforce_eager": False,
+        },
+    )
 
-Settings.llm = llm
-Settings.embed_model = embeddings
-Settings.chunk_size=1000
-Settings.chunk_overlap=100
-Settings.num_output = 256
-Settings.system_prompt=system_prompt
+    system_prompt="As a support engineer, your role is to leverage the information \
+        in the context provided. Your task is to respond to queries based strictly \
+        on the information available in the provided context. Do not create new \
+        information under any circumstances. Refrain from repeating yourself. \
+        Extract your response solely from the context mentioned above. \
+        If the context does not contain relevant information for the question, \
+        respond with 'How can I assist you with questions related to the document?"
 
-vector_store = QdrantVectorStore(
-    client=client,
-    collection_name="ansh"
-)
+    Settings.llm = llm
+    Settings.embed_model = embeddings
+    Settings.chunk_size=1000
+    Settings.chunk_overlap=100
+    Settings.num_output = 256
+    Settings.system_prompt=system_prompt
 
-storage_context = StorageContext.from_defaults(
-    vector_store=vector_store
-)
+    vector_store = QdrantVectorStore(
+        client=client,
+        collection_name="ansh"
+    )
 
-index = VectorStoreIndex.from_documents(
-    documents,
-    storage_context=storage_context
-)
+    storage_context = StorageContext.from_defaults(
+        vector_store=vector_store
+    )
 
-query_engine = index.as_query_engine(llm=llm)
+    index = VectorStoreIndex.from_documents(
+        documents,
+        storage_context=storage_context
+    )
 
-response = query_engine.query(
-    'What are the document formats supported by the Vision service?'
-)
+    query_engine = index.as_query_engine(llm=llm)
 
-print("Response: ", response.response.strip())
-for key in response.metadata.keys():
-    print("Source: ", response.metadata[key]['Source'])
+    response = query_engine.query(
+        'What are the document formats supported by the Vision service?'
+    )
+
+    print("Response: ", response.response.strip())
+    for key in response.metadata.keys():
+        print("Source: ", response.metadata[key]['Source'])