|
| 1 | +# Overview |
| 2 | + |
| 3 | +This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai/files). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by a NVIDIA A10 GPU. |
| 4 | + |
| 5 | +# Requirements |
| 6 | + |
| 7 | +* An OCI tenancy with A10 GPU quota. |
| 8 | + |
| 9 | +# Libraries |
| 10 | + |
| 11 | +* **LlamaIndex**: a data framework for LLM-based applications which benefit from context augmentation. |
| 12 | +* **LangChai**: a framework for developing applications powered by large language models. |
| 13 | +* **vLLM**: a fast and easy-to-use library for LLM inference and serving. |
| 14 | +* **Qdrant**: a vector similarity search engine. |
| 15 | + |
| 16 | +# Mistral LLM |
| 17 | + |
| 18 | +[Mistral.ai](https://mistral.ai/) is a French AI startup that develop Large Language Models (LLMs). Mistral 7B Instruct is a small yet powerful open model that supports English and code. The Instruct version is optimized for chat. In this example, inference performance is increased using the [FlashAttention](https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention) backend. |
| 19 | + |
| 20 | +# Instance Configuration |
| 21 | + |
| 22 | +In this example a single A10 GPU VM shape, codename VM.GPU.A10.1, is used. This is currently the smallest GPU shape available on OCI. With this configuration, it is necessary to limit the VLLM Model context length option to 16384 because the memory is unsufficient. To use the full context length, a dual A10 GPU, codename VM.GPU.A10.2, will be necessary. |
| 23 | +The image is the NVIDIA GPU Cloud Machine image from the OCI marketplace. |
| 24 | +A boot volume of 200 GB is also recommended. |
| 25 | + |
| 26 | +# Image Update |
| 27 | + |
| 28 | +For the sake of libraries age support, is highly recommended to update NVIDIA drivers and CUDA by running: |
| 29 | + |
| 30 | +``` |
| 31 | +sudo apt purge nvidia* libnvidia* |
| 32 | +sudo apt-get install -y cuda-drivers-545 |
| 33 | +sudo apt-get install -y nvidia-kernel-open-545 |
| 34 | +sudo apt-get install -y cuda-toolkit-12-3 |
| 35 | +sudo reboot |
| 36 | +``` |
| 37 | + |
| 38 | +# Framework deployment |
| 39 | + |
| 40 | +## Install packages |
| 41 | + |
| 42 | +First setup a virtual environment with all the required packages: |
| 43 | + |
| 44 | +``` |
| 45 | +python -m venv rag |
| 46 | +pip install -r requirements.txt |
| 47 | +source rag/bin/activate |
| 48 | +``` |
| 49 | + |
| 50 | +## Deploy the framework |
| 51 | + |
| 52 | +The python script creates an all-in-one framework with local instances of the Qdrant vector similarity search engine and the vLLM inference server. Alternatively it is possible to deploy these two components remotely using Docker containers. This option can be useful in two situations: |
| 53 | +* The engines are shared by multiple solutions for which data must segregated. |
| 54 | +* The engines are deployed on instances with optimized configurations (GPU, RAM, CPU cores, etc.). |
| 55 | + |
| 56 | +### Framework components |
| 57 | + |
| 58 | +* **SitemapReader**: Asynchronous sitemap reader for web. Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example the site mapxml file is stored in an OCI bucket. |
| 59 | +* **QdrantClient**: Python client for the Qdrant vector search engine. |
| 60 | +* **SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc. |
| 61 | +* **VLLM**: Fast and easy-to-use LLM inference server. |
| 62 | +* **Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example we use global configuration. |
| 63 | +* **QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection. |
| 64 | +* **StorageContext**: Utility container for storing nodes, indices, and vectors. |
| 65 | +* **VectorStoreIndex**: Index built from the documents loaded in the Vector Store. |
| 66 | + |
| 67 | +### Remote Qdrant client |
| 68 | + |
| 69 | +Instead of: |
| 70 | +``` |
| 71 | +client = QdrantClient(location=":memory:") |
| 72 | +``` |
| 73 | +Use: |
| 74 | +``` |
| 75 | +client = QdrantClient(host="localhost", port=6333) |
| 76 | +``` |
| 77 | + |
| 78 | +To deploy the container: |
| 79 | +``` |
| 80 | +docker pull qdrant/qdrant |
| 81 | +docker run -p 6333:6333 qdrant/qdrant |
| 82 | +``` |
| 83 | + |
| 84 | +### Remote vLLM server |
| 85 | + |
| 86 | +Instead of: |
| 87 | +``` |
| 88 | +from langchain_community.llms import VLLM |
| 89 | +
|
| 90 | +llm = VLLM( |
| 91 | + model="mistralai/Mistral-7B-v0.1", |
| 92 | + ... |
| 93 | + vllm_kwargs={ |
| 94 | + ... |
| 95 | + }, |
| 96 | +) |
| 97 | +``` |
| 98 | +Use: |
| 99 | +``` |
| 100 | +from langchain_community.llms import VLLMOpenAI |
| 101 | +
|
| 102 | +llm = VLLMOpenAI( |
| 103 | + openai_api_key="EMPTY", |
| 104 | + openai_api_base="http://localhost:8000/v1", |
| 105 | + model_name="mistralai/Mistral-7B-v0.1", |
| 106 | + model_kwargs={ |
| 107 | + ... |
| 108 | + }, |
| 109 | +) |
| 110 | +``` |
| 111 | +To deploy the container, refer to this [tutorial](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/vllm-mistral). |
| 112 | + |
| 113 | +# Notes |
| 114 | + |
| 115 | +The libraries used in this example are evolving quite fast. The python script provided here might have to be updated in a near future to avoid Warnings and Errors. |
| 116 | + |
| 117 | +# Documentation |
| 118 | + |
| 119 | +* [LlamaIndex](https://docs.llamaindex.ai/en/stable/) |
| 120 | +* [LangChain](https://python.langchain.com/docs/get_started/introduction) |
| 121 | +* [vLLM](https://docs.vllm.ai/en/latest/) |
| 122 | +* [Qdrant](https://qdrant.tech/documentation/) |
0 commit comments