Skip to content

Commit 599a31b

Browse files
Merge branch 'main' into conda-envs
2 parents 294d216 + d5561ab commit 599a31b

File tree

8 files changed

+413
-8
lines changed

8 files changed

+413
-8
lines changed
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Overview
2+
3+
This repository is a variant of the Retrieval Augmented Generation (RAG) tutorial available [here](https://github.com/oracle-devrel/technology-engineering/tree/main/ai-and-app-modernisation/ai-services/generative-ai-service/rag-genai/files). Instead of the OCI GenAI Service, it uses a local deployment of Mistral 7B Instruct v0.2 using a vLLM inference server powered by a NVIDIA A10 GPU.
4+
5+
# Requirements
6+
7+
* An OCI tenancy with A10 GPU quota.
8+
9+
# Libraries
10+
11+
* **LlamaIndex**: a data framework for LLM-based applications which benefit from context augmentation.
12+
* **LangChai**: a framework for developing applications powered by large language models.
13+
* **vLLM**: a fast and easy-to-use library for LLM inference and serving.
14+
* **Qdrant**: a vector similarity search engine.
15+
16+
# Mistral LLM
17+
18+
[Mistral.ai](https://mistral.ai/) is a French AI startup that develop Large Language Models (LLMs). Mistral 7B Instruct is a small yet powerful open model that supports English and code. The Instruct version is optimized for chat. In this example, inference performance is increased using the [FlashAttention](https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention) backend.
19+
20+
# Instance Configuration
21+
22+
In this example a single A10 GPU VM shape, codename VM.GPU.A10.1, is used. This is currently the smallest GPU shape available on OCI. With this configuration, it is necessary to limit the VLLM Model context length option to 16384 because the memory is unsufficient. To use the full context length, a dual A10 GPU, codename VM.GPU.A10.2, will be necessary.
23+
The image is the NVIDIA GPU Cloud Machine image from the OCI marketplace.
24+
A boot volume of 200 GB is also recommended.
25+
26+
# Image Update
27+
28+
For the sake of libraries age support, is highly recommended to update NVIDIA drivers and CUDA by running:
29+
30+
```
31+
sudo apt purge nvidia* libnvidia*
32+
sudo apt-get install -y cuda-drivers-545
33+
sudo apt-get install -y nvidia-kernel-open-545
34+
sudo apt-get install -y cuda-toolkit-12-3
35+
sudo reboot
36+
```
37+
38+
# Framework deployment
39+
40+
## Install packages
41+
42+
First setup a virtual environment with all the required packages:
43+
44+
```
45+
python -m venv rag
46+
pip install -r requirements.txt
47+
source rag/bin/activate
48+
```
49+
50+
## Deploy the framework
51+
52+
The python script creates an all-in-one framework with local instances of the Qdrant vector similarity search engine and the vLLM inference server. Alternatively it is possible to deploy these two components remotely using Docker containers. This option can be useful in two situations:
53+
* The engines are shared by multiple solutions for which data must segregated.
54+
* The engines are deployed on instances with optimized configurations (GPU, RAM, CPU cores, etc.).
55+
56+
### Framework components
57+
58+
* **SitemapReader**: Asynchronous sitemap reader for web. Reads pages from the web based on their sitemap.xml. Other data connectors are available (Snowflake, Twitter, Wikipedia, etc.). In this example the site mapxml file is stored in an OCI bucket.
59+
* **QdrantClient**: Python client for the Qdrant vector search engine.
60+
* **SentenceTransformerEmbeddings**: Sentence embeddings model object (from HuggingFace). Other options include Aleph Alpha, Cohere, MistralAI, SpaCy, etc.
61+
* **VLLM**: Fast and easy-to-use LLM inference server.
62+
* **Settings**: Bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. In this example we use global configuration.
63+
* **QdrantVectorStore**: Vector store where embeddings and docs are stored within a Qdrant collection.
64+
* **StorageContext**: Utility container for storing nodes, indices, and vectors.
65+
* **VectorStoreIndex**: Index built from the documents loaded in the Vector Store.
66+
67+
### Remote Qdrant client
68+
69+
Instead of:
70+
```
71+
client = QdrantClient(location=":memory:")
72+
```
73+
Use:
74+
```
75+
client = QdrantClient(host="localhost", port=6333)
76+
```
77+
78+
To deploy the container:
79+
```
80+
docker pull qdrant/qdrant
81+
docker run -p 6333:6333 qdrant/qdrant
82+
```
83+
84+
### Remote vLLM server
85+
86+
Instead of:
87+
```
88+
from langchain_community.llms import VLLM
89+
90+
llm = VLLM(
91+
model="mistralai/Mistral-7B-v0.1",
92+
...
93+
vllm_kwargs={
94+
...
95+
},
96+
)
97+
```
98+
Use:
99+
```
100+
from langchain_community.llms import VLLMOpenAI
101+
102+
llm = VLLMOpenAI(
103+
openai_api_key="EMPTY",
104+
openai_api_base="http://localhost:8000/v1",
105+
model_name="mistralai/Mistral-7B-v0.1",
106+
model_kwargs={
107+
...
108+
},
109+
)
110+
```
111+
To deploy the container, refer to this [tutorial](https://github.com/oracle-devrel/technology-engineering/tree/main/cloud-infrastructure/ai-infra-gpu/GPU/vllm-mistral).
112+
113+
# Notes
114+
115+
The libraries used in this example are evolving quite fast. The python script provided here might have to be updated in a near future to avoid Warnings and Errors.
116+
117+
# Documentation
118+
119+
* [LlamaIndex](https://docs.llamaindex.ai/en/stable/)
120+
* [LangChain](https://python.langchain.com/docs/get_started/introduction)
121+
* [vLLM](https://docs.vllm.ai/en/latest/)
122+
* [Qdrant](https://qdrant.tech/documentation/)
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
from llama_index.core import VectorStoreIndex, StorageContext, Settings
2+
from llama_index.vector_stores.qdrant import QdrantVectorStore
3+
from llama_index.readers.web import SitemapReader
4+
from qdrant_client import QdrantClient
5+
from langchain_community.embeddings import SentenceTransformerEmbeddings
6+
from langchain_community.llms import VLLM, VLLMOpenAI
7+
8+
9+
loader = SitemapReader(html_to_text=True)
10+
# Reads pages from the web based on their sitemap.xml.
11+
# Other data connectors available.
12+
13+
documents = loader.load_data(
14+
sitemap_url='https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frpj5kvxryk1/b/thisIsThePlace/o/latest.xml'
15+
)
16+
for document in documents:
17+
print(document.metadata['Source'])
18+
19+
# local Docker-based instance of Qdrant
20+
client = QdrantClient(
21+
location=":memory:"
22+
)
23+
24+
embeddings = SentenceTransformerEmbeddings(
25+
model_name="all-MiniLM-L6-v2"
26+
)
27+
28+
# local instance of Mistral 7B v0.1 using vLLM inference server
29+
# and FlashAttention backend for performance. Model is downloaded
30+
# from HuggingFace (no accoutn needed).
31+
llm = VLLM(
32+
model="mistralai/Mistral-7B-Instruct-v0.2",
33+
gpu_memory_utilization=0.95,
34+
tensor_parallel_size=1, # inference distributed over X GPUs
35+
trust_remote_code=True, # mandatory for hf model
36+
max_new_tokens=128,
37+
top_k=10,
38+
top_p=0.95,
39+
temperature=0.8,
40+
vllm_kwargs={
41+
"swap_space": 1,
42+
"gpu_memory_utilization": 0.95,
43+
"max_model_len": 16384, # limitation due to unsufficient RAM
44+
"enforce_eager": True,
45+
},
46+
)
47+
48+
system_prompt="As a support engineer, your role is to leverage the information \
49+
in the context provided. Your task is to respond to queries based strictly \
50+
on the information available in the provided context. Do not create new \
51+
information under any circumstances. Refrain from repeating yourself. \
52+
Extract your response solely from the context mentioned above. \
53+
If the context does not contain relevant information for the question, \
54+
respond with 'How can I assist you with questions related to the document?"
55+
56+
Settings.llm = llm
57+
Settings.embed_model = embeddings
58+
Settings.chunk_size=1000
59+
Settings.chunk_overlap=100
60+
Settings.num_output = 256
61+
Settings.system_prompt=system_prompt
62+
63+
vector_store = QdrantVectorStore(
64+
client=client,
65+
collection_name="ansh"
66+
)
67+
68+
storage_context = StorageContext.from_defaults(
69+
vector_store=vector_store
70+
)
71+
72+
index = VectorStoreIndex.from_documents(
73+
documents,
74+
storage_context=storage_context
75+
)
76+
77+
query_engine = index.as_query_engine(llm=llm)
78+
79+
response = query_engine.query(
80+
'What are the document formats supported by the Vision service?'
81+
)
82+
83+
print("Response: ", response.response.strip())
84+
for key in response.metadata.keys():
85+
print("Source: ", response.metadata[key]['Source'])

0 commit comments

Comments
 (0)