Skip to content

Commit 23e58ee

Browse files
Merge pull request #1049 from oracle-devrel/triton-mixtral-8x7b
Triton mixtral 8x7b
2 parents 885a0c2 + beffed8 commit 23e58ee

File tree

6 files changed

+1644
-0
lines changed

6 files changed

+1644
-0
lines changed
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Overview
2+
3+
This repository provides a step-by-step tutorial for deploying and using Mixtral 8x7B Large Language Model using the NVIDIA Triton Inference Server and the TensorRT-LLM backend.
4+
5+
# Requirements
6+
7+
* An OCI tenancy with GPU4 (A100 40GB) quota
8+
* A Huggingface account with a valid Auth Token
9+
10+
# Model Deployment
11+
12+
## Instance Configuration
13+
14+
In this example a BM.GPU4.8 instance is used. The image is the Oracle Linux Gen2 GPU image. A boot volume of 1000 GB is recommended (running `sudo /usr/libexec/oci-growfs -y` might be necessary). Alternatively, one of the NVMe local drives can be mounted.
15+
16+
## Package Install
17+
18+
### Install and configure Docker
19+
20+
Enable all the required repositories. To do this you are going to need the yum-utils package.
21+
```
22+
sudo dnf install -y dnf-utils zip unzip
23+
sudo dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
24+
```
25+
Install Docker.
26+
```
27+
sudo dnf remove -y runc
28+
sudo dnf install -y docker-ce --nobest
29+
```
30+
Enable and start the Docker service.
31+
```
32+
sudo systemctl enable docker.service
33+
sudo systemctl start docker.service
34+
```
35+
36+
### Install and configure NVIDIA Container Toolkit
37+
38+
Configure the production repository.
39+
```
40+
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
41+
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
42+
```
43+
Optionally, configure the repository to use experimental packages.
44+
```
45+
sudo yum-config-manager --enable nvidia-container-toolkit-experimental
46+
```
47+
Install the NVIDIA Container Toolkit packages.
48+
```
49+
sudo yum install -y nvidia-container-toolkit
50+
```
51+
Configure the container runtime by using the nvidia-ctk command.
52+
```
53+
sudo nvidia-ctk runtime configure --runtime=docker
54+
sudo systemctl restart docker
55+
```
56+
57+
## Build the tensortrtllm_backend container
58+
59+
Clone the repoistory.
60+
```
61+
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
62+
```
63+
Install git-lfs.
64+
```
65+
sudo yum install git git-lfs -y
66+
```
67+
Go to the directory and update the submodules.
68+
```
69+
cd tensorrtllm_backend
70+
git lfs install
71+
git submodule update --init --recursive
72+
```
73+
Build the backend container using the dockerfile.
74+
```
75+
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
76+
```
77+
## Build the engines
78+
79+
Start the container.
80+
```
81+
sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/opc/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
82+
```
83+
Then build the model engines with tensor parallelism (split and fit the model on the 8 A100 GPUs).
84+
```
85+
python ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \
86+
--output_dir ./tllm_checkpoint_mixtral_8gpu \
87+
--dtype float16 \
88+
--tp_size 8
89+
trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_8gpu \
90+
--output_dir ./trt_engines/mixtral/tp8 \
91+
--gemm_plugin float16
92+
```
93+
The egines files are located in the `./trt_engines/mixtral/tp8` folder.
94+
95+
## Prepare the model repository
96+
97+
Create the model repository that will be used by the Triton inference server.
98+
```
99+
cd tensorrtllm_backend
100+
mkdir triton_model_repo
101+
```
102+
Copy the example models to the model repository.
103+
```
104+
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
105+
```
106+
Copy the engines to the model repository.
107+
```
108+
cp tensorrt_llm/examples/mixtral/trt_engines/mixtral/tp8/* triton_model_repo/tensorrt_llm/1
109+
```
110+
It is now time to modify the config.pbtxt files. Following the guidelines from the [official repo](https://github.com/triton-inference-server/tensorrtllm_backend), here are the sections to be modified:
111+
112+
* tensorrtllm_backend/triton_model_repo/ensemble/config.pbtxt
113+
114+
```
115+
max_batch_size: 1
116+
```
117+
118+
* tensorrtllm_backend/triton_model_repo/postprocessing/config.pbtxt
119+
120+
```
121+
max_batch_size: 1
122+
...
123+
parameters {
124+
key: "tokenizer_dir"
125+
value: {
126+
string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-Instruct-v0.1"
127+
}
128+
}
129+
130+
parameters {
131+
key: "skip_special_tokens"
132+
value: {
133+
string_value: "True"
134+
}
135+
}
136+
137+
instance_group [
138+
{
139+
count: 1
140+
kind: KIND_CPU
141+
}
142+
]
143+
```
144+
145+
* tensorrtllm_backend/triton_model_repo/preprocessing/config.pbtxt
146+
147+
```
148+
max_batch_size: 1
149+
...
150+
parameters {
151+
key: "tokenizer_dir"
152+
value: {
153+
string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-Instruct-v0.1"
154+
}
155+
}
156+
157+
parameters {
158+
key: "skip_special_tokens"
159+
value: {
160+
string_value: "True"
161+
}
162+
}
163+
164+
instance_group [
165+
{
166+
count: 1
167+
kind: KIND_CPU
168+
}
169+
]
170+
```
171+
172+
* tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt
173+
```
174+
max_batch_size: 1
175+
176+
model_transaction_policy {
177+
decoupled: true
178+
}
179+
180+
dynamic_batching {
181+
preferred_batch_size: [ 1 ]
182+
max_queue_delay_microseconds: 100
183+
}
184+
...
185+
instance_group [
186+
{
187+
count: 1
188+
kind : KIND_CPU
189+
}
190+
]
191+
...
192+
parameters: {
193+
key: "gpt_model_type"
194+
value: {
195+
string_value: "inflight_fused_batching"
196+
}
197+
}
198+
parameters: {
199+
key: "gpt_model_path"
200+
value: {
201+
string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
202+
}
203+
}
204+
...
205+
parameters: {
206+
key: "batch_scheduler_policy"
207+
value: {
208+
string_value: "max_utilization"
209+
}
210+
}
211+
```
212+
213+
* tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/config.pbtxt
214+
215+
```
216+
max_batch_size: 1
217+
218+
model_transaction_policy {
219+
decoupled: true
220+
}
221+
...
222+
instance_group [
223+
{
224+
count: 1
225+
kind : KIND_CPU
226+
}
227+
]
228+
```
229+
Files examples are provided in this repo.
230+
231+
## Run the inference server
232+
233+
Once all the files are ready, start the container that has been built previously:
234+
```
235+
sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/opc/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
236+
```
237+
and from within the container start the server by running the following python command:
238+
```
239+
python3 scripts/launch_triton_server.py --world_size=8 --model_repo=/tensorrtllm_backend/triton_model_repo
240+
```
241+
where `--world_size` is the number of GPUs you want to use for serving.
242+
If the deployment is successful you should get something like:
243+
```
244+
I0919 14:52:10.475738 293 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
245+
I0919 14:52:10.475968 293 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
246+
I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
247+
```
248+
## Test the model
249+
250+
To test the model, one can query the the server endpoint, for example with:
251+
```
252+
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is cloud computing?", "max_tokens": 512, "bad_words": "", "stop_words": ""}'
253+
```
254+
255+
# Resources
256+
257+
* [TensortRT-LLM Backend Documentation](https://github.com/triton-inference-server/tensorrtllm_backend)
258+
* [Mistral Documentation](https://docs.mistral.ai/)

0 commit comments

Comments
 (0)