|
| 1 | +# Overview |
| 2 | + |
| 3 | +This repository provides a step-by-step tutorial for deploying and using Mixtral 8x7B Large Language Model using the NVIDIA Triton Inference Server and the TensorRT-LLM backend. |
| 4 | + |
| 5 | +# Requirements |
| 6 | + |
| 7 | +* An OCI tenancy with GPU4 (A100 40GB) quota |
| 8 | +* A Huggingface account with a valid Auth Token |
| 9 | + |
| 10 | +# Model Deployment |
| 11 | + |
| 12 | +## Instance Configuration |
| 13 | + |
| 14 | +In this example a BM.GPU4.8 instance is used. The image is the Oracle Linux Gen2 GPU image. A boot volume of 1000 GB is recommended (running `sudo /usr/libexec/oci-growfs -y` might be necessary). Alternatively, one of the NVMe local drives can be mounted. |
| 15 | + |
| 16 | +## Package Install |
| 17 | + |
| 18 | +### Install and configure Docker |
| 19 | + |
| 20 | +Enable all the required repositories. To do this you are going to need the yum-utils package. |
| 21 | +``` |
| 22 | +sudo dnf install -y dnf-utils zip unzip |
| 23 | +sudo dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo |
| 24 | +``` |
| 25 | +Install Docker. |
| 26 | +``` |
| 27 | +sudo dnf remove -y runc |
| 28 | +sudo dnf install -y docker-ce --nobest |
| 29 | +``` |
| 30 | +Enable and start the Docker service. |
| 31 | +``` |
| 32 | +sudo systemctl enable docker.service |
| 33 | +sudo systemctl start docker.service |
| 34 | +``` |
| 35 | + |
| 36 | +### Install and configure NVIDIA Container Toolkit |
| 37 | + |
| 38 | +Configure the production repository. |
| 39 | +``` |
| 40 | +curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ |
| 41 | +sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo |
| 42 | +``` |
| 43 | +Optionally, configure the repository to use experimental packages. |
| 44 | +``` |
| 45 | +sudo yum-config-manager --enable nvidia-container-toolkit-experimental |
| 46 | +``` |
| 47 | +Install the NVIDIA Container Toolkit packages. |
| 48 | +``` |
| 49 | +sudo yum install -y nvidia-container-toolkit |
| 50 | +``` |
| 51 | +Configure the container runtime by using the nvidia-ctk command. |
| 52 | +``` |
| 53 | +sudo nvidia-ctk runtime configure --runtime=docker |
| 54 | +sudo systemctl restart docker |
| 55 | +``` |
| 56 | + |
| 57 | +## Build the tensortrtllm_backend container |
| 58 | + |
| 59 | +Clone the repoistory. |
| 60 | +``` |
| 61 | +git clone https://github.com/triton-inference-server/tensorrtllm_backend.git |
| 62 | +``` |
| 63 | +Install git-lfs. |
| 64 | +``` |
| 65 | +sudo yum install git git-lfs -y |
| 66 | +``` |
| 67 | +Go to the directory and update the submodules. |
| 68 | +``` |
| 69 | +cd tensorrtllm_backend |
| 70 | +git lfs install |
| 71 | +git submodule update --init --recursive |
| 72 | +``` |
| 73 | +Build the backend container using the dockerfile. |
| 74 | +``` |
| 75 | +DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend . |
| 76 | +``` |
| 77 | +## Build the engines |
| 78 | + |
| 79 | +Start the container. |
| 80 | +``` |
| 81 | +sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/opc/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash |
| 82 | +``` |
| 83 | +Then build the model engines with tensor parallelism (split and fit the model on the 8 A100 GPUs). |
| 84 | +``` |
| 85 | +python ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \ |
| 86 | + --output_dir ./tllm_checkpoint_mixtral_8gpu \ |
| 87 | + --dtype float16 \ |
| 88 | + --tp_size 8 |
| 89 | +trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_8gpu \ |
| 90 | + --output_dir ./trt_engines/mixtral/tp8 \ |
| 91 | + --gemm_plugin float16 |
| 92 | +``` |
| 93 | +The egines files are located in the `./trt_engines/mixtral/tp8` folder. |
| 94 | + |
| 95 | +## Prepare the model repository |
| 96 | + |
| 97 | +Create the model repository that will be used by the Triton inference server. |
| 98 | +``` |
| 99 | +cd tensorrtllm_backend |
| 100 | +mkdir triton_model_repo |
| 101 | +``` |
| 102 | +Copy the example models to the model repository. |
| 103 | +``` |
| 104 | +cp -r all_models/inflight_batcher_llm/* triton_model_repo/ |
| 105 | +``` |
| 106 | +Copy the engines to the model repository. |
| 107 | +``` |
| 108 | +cp tensorrt_llm/examples/mixtral/trt_engines/mixtral/tp8/* triton_model_repo/tensorrt_llm/1 |
| 109 | +``` |
| 110 | +It is now time to modify the config.pbtxt files. Following the guidelines from the [official repo](https://github.com/triton-inference-server/tensorrtllm_backend), here are the sections to be modified: |
| 111 | + |
| 112 | +* tensorrtllm_backend/triton_model_repo/ensemble/config.pbtxt |
| 113 | + |
| 114 | +``` |
| 115 | +max_batch_size: 1 |
| 116 | +``` |
| 117 | + |
| 118 | +* tensorrtllm_backend/triton_model_repo/postprocessing/config.pbtxt |
| 119 | + |
| 120 | +``` |
| 121 | +max_batch_size: 1 |
| 122 | +... |
| 123 | +parameters { |
| 124 | + key: "tokenizer_dir" |
| 125 | + value: { |
| 126 | + string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-Instruct-v0.1" |
| 127 | + } |
| 128 | +} |
| 129 | +
|
| 130 | +parameters { |
| 131 | + key: "skip_special_tokens" |
| 132 | + value: { |
| 133 | + string_value: "True" |
| 134 | + } |
| 135 | +} |
| 136 | +
|
| 137 | +instance_group [ |
| 138 | + { |
| 139 | + count: 1 |
| 140 | + kind: KIND_CPU |
| 141 | + } |
| 142 | +] |
| 143 | +``` |
| 144 | + |
| 145 | +* tensorrtllm_backend/triton_model_repo/preprocessing/config.pbtxt |
| 146 | + |
| 147 | +``` |
| 148 | +max_batch_size: 1 |
| 149 | +... |
| 150 | +parameters { |
| 151 | + key: "tokenizer_dir" |
| 152 | + value: { |
| 153 | + string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-Instruct-v0.1" |
| 154 | + } |
| 155 | +} |
| 156 | +
|
| 157 | +parameters { |
| 158 | + key: "skip_special_tokens" |
| 159 | + value: { |
| 160 | + string_value: "True" |
| 161 | + } |
| 162 | +} |
| 163 | +
|
| 164 | +instance_group [ |
| 165 | + { |
| 166 | + count: 1 |
| 167 | + kind: KIND_CPU |
| 168 | + } |
| 169 | +] |
| 170 | +``` |
| 171 | + |
| 172 | +* tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt |
| 173 | +``` |
| 174 | +max_batch_size: 1 |
| 175 | +
|
| 176 | +model_transaction_policy { |
| 177 | + decoupled: true |
| 178 | +} |
| 179 | +
|
| 180 | +dynamic_batching { |
| 181 | + preferred_batch_size: [ 1 ] |
| 182 | + max_queue_delay_microseconds: 100 |
| 183 | +} |
| 184 | +... |
| 185 | +instance_group [ |
| 186 | + { |
| 187 | + count: 1 |
| 188 | + kind : KIND_CPU |
| 189 | + } |
| 190 | +] |
| 191 | +... |
| 192 | +parameters: { |
| 193 | + key: "gpt_model_type" |
| 194 | + value: { |
| 195 | + string_value: "inflight_fused_batching" |
| 196 | + } |
| 197 | +} |
| 198 | +parameters: { |
| 199 | + key: "gpt_model_path" |
| 200 | + value: { |
| 201 | + string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" |
| 202 | + } |
| 203 | +} |
| 204 | +... |
| 205 | +parameters: { |
| 206 | + key: "batch_scheduler_policy" |
| 207 | + value: { |
| 208 | + string_value: "max_utilization" |
| 209 | + } |
| 210 | +} |
| 211 | +``` |
| 212 | + |
| 213 | +* tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/config.pbtxt |
| 214 | + |
| 215 | +``` |
| 216 | +max_batch_size: 1 |
| 217 | +
|
| 218 | +model_transaction_policy { |
| 219 | + decoupled: true |
| 220 | +} |
| 221 | +... |
| 222 | +instance_group [ |
| 223 | + { |
| 224 | + count: 1 |
| 225 | + kind : KIND_CPU |
| 226 | + } |
| 227 | +] |
| 228 | +``` |
| 229 | +Files examples are provided in this repo. |
| 230 | + |
| 231 | +## Run the inference server |
| 232 | + |
| 233 | +Once all the files are ready, start the container that has been built previously: |
| 234 | +``` |
| 235 | +sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/opc/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash |
| 236 | +``` |
| 237 | +and from within the container start the server by running the following python command: |
| 238 | +``` |
| 239 | +python3 scripts/launch_triton_server.py --world_size=8 --model_repo=/tensorrtllm_backend/triton_model_repo |
| 240 | +``` |
| 241 | +where `--world_size` is the number of GPUs you want to use for serving. |
| 242 | +If the deployment is successful you should get something like: |
| 243 | +``` |
| 244 | +I0919 14:52:10.475738 293 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001 |
| 245 | +I0919 14:52:10.475968 293 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000 |
| 246 | +I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002 |
| 247 | +``` |
| 248 | +## Test the model |
| 249 | + |
| 250 | +To test the model, one can query the the server endpoint, for example with: |
| 251 | +``` |
| 252 | +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is cloud computing?", "max_tokens": 512, "bad_words": "", "stop_words": ""}' |
| 253 | +``` |
| 254 | + |
| 255 | +# Resources |
| 256 | + |
| 257 | +* [TensortRT-LLM Backend Documentation](https://github.com/triton-inference-server/tensorrtllm_backend) |
| 258 | +* [Mistral Documentation](https://docs.mistral.ai/) |
0 commit comments