|
| 1 | +# TensorRT-LLM Briton with Qwen/Qwen2-57B-A14B-MoE-int4 |
| 2 | + |
| 3 | +This is a Deployment for TensorRT-LLM Briton with Qwen/Qwen2-57B-A14B-MoE-int4. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral) |
| 4 | + |
| 5 | +With Briton you get the following benefits by default: |
| 6 | +- *Lowest-latency* latency, beating frameworks such as vllm |
| 7 | +- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching. |
| 8 | +- *distributed inference* run large models (such as LLama-405B) tensor-parallel |
| 9 | +- *json-schema based structured output for any model* |
| 10 | +- *chunked prefilling* for long generation tasks |
| 11 | + |
| 12 | +Optionally, you can also enable: |
| 13 | +- *speculative decoding* using an external draft model or self-speculative decoding |
| 14 | +- *fp8 quantization* deployments on H100, H200 and L4 GPUs |
| 15 | + |
| 16 | + |
| 17 | +# Examples: |
| 18 | +This deployment is specifically designed for the Hugging Face model [Qwen/Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct). |
| 19 | +Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models. |
| 20 | + |
| 21 | +Qwen/Qwen2-57B-A14B-Instruct is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more. |
| 22 | + |
| 23 | + |
| 24 | +## Deployment with Truss |
| 25 | + |
| 26 | +Before deployment: |
| 27 | + |
| 28 | +1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys). |
| 29 | +2. Install the latest version of Truss: `pip install --upgrade truss` |
| 30 | + |
| 31 | + |
| 32 | +First, clone this repository: |
| 33 | +```sh |
| 34 | +git clone https://github.com/basetenlabs/truss-examples.git |
| 35 | +cd 11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2-57b-a14b-moe-int4 |
| 36 | +``` |
| 37 | + |
| 38 | +With `11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2-57b-a14b-moe-int4` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted. |
| 39 | + |
| 40 | +```sh |
| 41 | +truss push --publish |
| 42 | +# prints: |
| 43 | +# ✨ Model Briton-qwen-qwen2-57b-a14b-moe-int4-truss-example was successfully pushed ✨ |
| 44 | +# 🪵 View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx |
| 45 | +``` |
| 46 | + |
| 47 | +## Call your model |
| 48 | + |
| 49 | +### OpenAI compatible inference |
| 50 | +This solution is OpenAI compatible, which means you can use the OpenAI client library to interact with the model. |
| 51 | + |
| 52 | +```python |
| 53 | +from openai import OpenAI |
| 54 | +import os |
| 55 | + |
| 56 | +client = OpenAI( |
| 57 | + api_key=os.environ['BASETEN_API_KEY'], |
| 58 | + base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1" |
| 59 | +) |
| 60 | + |
| 61 | +# Default completion |
| 62 | +response_completion = client.completions.create( |
| 63 | + model="not_required", |
| 64 | + prompt="Q: Tell me everything about Baseten.co! A:", |
| 65 | + temperature=0.3, |
| 66 | + max_tokens=100, |
| 67 | +) |
| 68 | + |
| 69 | +# Chat completion |
| 70 | +response_chat = client.chat.completions.create( |
| 71 | + model="", |
| 72 | + messages=[ |
| 73 | + {"role": "user", "content": "Tell me everything about Baseten.co!"} |
| 74 | + ], |
| 75 | + temperature=0.3, |
| 76 | + max_tokens=100, |
| 77 | +) |
| 78 | + |
| 79 | +# Structured output |
| 80 | +from pydantic import BaseModel |
| 81 | + |
| 82 | +class CalendarEvent(BaseModel): |
| 83 | + name: str |
| 84 | + date: str |
| 85 | + participants: list[str] |
| 86 | + |
| 87 | +completion = client.beta.chat.completions.parse( |
| 88 | + model="not_required", |
| 89 | + messages=[ |
| 90 | + {"role": "system", "content": "Extract the event information."}, |
| 91 | + {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."}, |
| 92 | + ], |
| 93 | + response_format=CalendarEvent, |
| 94 | +) |
| 95 | + |
| 96 | +event = completion.choices[0].message.parsed |
| 97 | + |
| 98 | +# If you model supports tool-calling, you can use the following example: |
| 99 | +tools = [{ |
| 100 | + "type": "function", |
| 101 | + "function": { |
| 102 | + "name": "get_weather", |
| 103 | + "description": "Get current temperature for a given location.", |
| 104 | + "parameters": { |
| 105 | + "type": "object", |
| 106 | + "properties": { |
| 107 | + "location": { |
| 108 | + "type": "string", |
| 109 | + "description": "City and country e.g. Bogotá, Colombia" |
| 110 | + } |
| 111 | + }, |
| 112 | + "required": [ |
| 113 | + "location" |
| 114 | + ], |
| 115 | + "additionalProperties": False |
| 116 | + }, |
| 117 | + "strict": True |
| 118 | + } |
| 119 | +}] |
| 120 | + |
| 121 | +completion = client.chat.completions.create( |
| 122 | + model="not_required", |
| 123 | + messages=[{"role": "user", "content": "What is the weather like in Paris today?"}], |
| 124 | + tools=tools |
| 125 | +) |
| 126 | + |
| 127 | +print(completion.choices[0].message.tool_calls) |
| 128 | +``` |
| 129 | + |
| 130 | + |
| 131 | +## Config.yaml |
| 132 | +By default, the following configuration is used for this deployment. This config uses `quantization_type=weights_int4`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16. |
| 133 | + |
| 134 | +```yaml |
| 135 | +build_commands: [] |
| 136 | +environment_variables: {} |
| 137 | +external_package_dirs: [] |
| 138 | +model_metadata: |
| 139 | + example_model_input: |
| 140 | + max_tokens: 512 |
| 141 | + messages: |
| 142 | + - content: Tell me everything you know about optimized inference. |
| 143 | + role: user |
| 144 | + stream: true |
| 145 | + temperature: 0.5 |
| 146 | + tags: |
| 147 | + - openai-compatible |
| 148 | +model_name: Briton-qwen-qwen2-57b-a14b-moe-int4-truss-example |
| 149 | +python_version: py39 |
| 150 | +requirements: [] |
| 151 | +resources: |
| 152 | + accelerator: A100 |
| 153 | + cpu: '1' |
| 154 | + memory: 10Gi |
| 155 | + use_gpu: true |
| 156 | +secrets: {} |
| 157 | +system_packages: [] |
| 158 | +trt_llm: |
| 159 | + build: |
| 160 | + base_model: llama |
| 161 | + checkpoint_repository: |
| 162 | + repo: Qwen/Qwen2-57B-A14B-Instruct |
| 163 | + revision: main |
| 164 | + source: HF |
| 165 | + max_seq_len: 32768 |
| 166 | + num_builder_gpus: 4 |
| 167 | + quantization_config: |
| 168 | + calib_max_seq_length: 4096 |
| 169 | + calib_size: 3072 |
| 170 | + quantization_type: weights_int4 |
| 171 | + tensor_parallel_count: 1 |
| 172 | + runtime: |
| 173 | + enable_chunked_context: true |
| 174 | + |
| 175 | +``` |
| 176 | + |
| 177 | +## Support |
| 178 | +If you have any questions or need assistance, please open an issue in this repository or contact our support team. |
0 commit comments