Skip to content

Commit c7dfa58

Browse files
Merge branch 'main' into cris_branch
2 parents 821afb1 + 014c8c4 commit c7dfa58

File tree

1 file changed

+125
-0
lines changed
  • cloud-infrastructure/ai-infra-gpu/GPU/vllm-mistral

1 file changed

+125
-0
lines changed
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Overview
2+
3+
This repository provides a step-by-step tutorial for deploying and using [Mistral 7B Instruct](https://mistral.ai/technology/#models) Large Language Model using the [vLLM](https://github.com/vllm-project/vllm?tab=readme-ov-file) library.
4+
5+
# Requirements
6+
7+
* An OCI tenancy with A10 GPU quota.
8+
* A [Huggingface](https://huggingface.co/) account with a valid Auth Token.
9+
10+
# Model Deployment
11+
12+
## Mistral models
13+
14+
[Mistral.ai](https://mistral.ai/) is a French AI startup that develop Large Language Models (LLMs). Mistral 7B is the small yet powerful open model that supports English and code. The Mistral 7B Instruct is a chat optimized version of Mistral 7B. Mixtral 8x7B is a 7B sparse Mixture-of-Experts that supports French, Italian, German and Spanish on top of English and code (stronger than Mistral 7B). It uses 12B parameters out of 45B total.
15+
16+
## vLLM Library
17+
18+
vLLM is an alternative model serving solution to NVIDIA Triton. It is easy to use as it comes as a preconfigured container.
19+
20+
## Instance Configuration
21+
22+
In this example a single A10 GPU VM shape, codename VM.GPU.A10.1, is used. The image is the NVIDIA GPU Cloud Machine image from the OCI marketplace. A boot volume of 200 GB is also recommended.
23+
24+
## Image Update
25+
26+
Since the latest NVIDIA GPU Cloud Machine image is almost 1 year old, it is recommended to update NVIDIA drivers and CUDA by running:
27+
28+
```
29+
sudo apt purge nvidia* libnvidia*
30+
sudo apt-get install -y cuda-drivers-545
31+
sudo apt-get install -y nvidia-kernel-open-545
32+
sudo apt-get install -y cuda-toolkit-12-3
33+
sudo reboot
34+
```
35+
36+
## System configuration
37+
38+
Once the NVIDIA packages are updated, it is necessary to reconfigure docker in order to make it GPU aware:
39+
40+
```
41+
sudo apt-get install -y nvidia-container-toolkit
42+
sudo nvidia-ctk runtime configure --runtime=docker
43+
sudo systemctl restart docker
44+
```
45+
46+
## Container Deployment
47+
48+
To deploy the model, simply run the vLLM container:
49+
50+
```
51+
docker run --gpus all \
52+
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
53+
ghcr.io/mistralai/mistral-src/vllm:latest \
54+
--host 0.0.0.0 \
55+
--model mistralai/Mistral-7B-Instruct-v0.2
56+
```
57+
where `$HF_TOKEN` is the HuggingFace Auth Token set as an environment variable. Pulling the image for the container may take up to 20 minutes.
58+
59+
Once the deployment is finished, the model is available by default at http://0.0.0.0:8000.
60+
61+
# Model Calling
62+
63+
The Mistral model is available through a OpenAI compatible API. As a prerequisite you must have the curl package installed.
64+
65+
```
66+
sudo apt-get install curl
67+
```
68+
69+
Below are 3 examples of curl requests. The `json_pp` utility (JSON Pretty Printer) eases the model output reading by printing the JSON data in a legible, indented format.
70+
71+
* Check the model version available in the container:
72+
73+
```
74+
curl http://localhost:8000/v1/models | json_pp
75+
```
76+
77+
* Complete a sentence:
78+
79+
```
80+
curl http://localhost:8000/v1/completions \
81+
-H "Content-Type: application/json" \
82+
-d '{
83+
"model": "mistralai/Mistral-7B-Instruct-v0.2",
84+
"prompt": "A GPU is a",
85+
"max_tokens": 128,
86+
"temperature": 0.7
87+
}' | json_pp
88+
```
89+
90+
* Chat
91+
92+
```
93+
curl http://localhost:8000/v1/chat/completions \
94+
-H "Content-Type: application/json" \
95+
-d '{
96+
"model": "mistralai/Mistral-7B-Instruct-v0.2",
97+
"messages": [
98+
{"role": "user", "content": "Which GPU models are available on Oracle Cloud Infrastructure?"}
99+
]
100+
}' | json_pp
101+
```
102+
103+
# Notes
104+
105+
Mixtral8x7B is much more greedy that Mistral 7B and it will not fit in a single A10 GPU VM, nor a quad A10 GPU BM. Therefore it is necessary to either:
106+
* Increase the size of the shape to a BM.GPU4.8 (8 x A100 40 GB GPUs).
107+
* Use a quantized version such as [TheBloke/mixtral-8x7b-v0.1-AWQ](https://huggingface.co/TheBloke/mixtral-8x7b-v0.1-AWQ). However, AWQ quantization on vLLM is not fully optimized yet so speed might be lower than the original model.
108+
109+
```
110+
docker run --gpus all \
111+
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
112+
vllm/vllm-openai:latest \
113+
--host 0.0.0.0 \
114+
--port 8000 \
115+
--model TheBloke/mixtral-8x7b-v0.1-AWQ \
116+
--quantization awq \
117+
--tensor-parallel-size 4 \
118+
--gpu-memory-utilization 0.95 \
119+
--enforce-eager
120+
```
121+
122+
# Resources
123+
124+
* [vLLM Documentation](https://docs.vllm.ai/en/latest/)
125+
* [Mistral Documentation](https://docs.mistral.ai/)

0 commit comments

Comments
 (0)