Skip to content

Commit efe19de

Browse files
committed
add qwen moe
1 parent 87551df commit efe19de

File tree

5 files changed

+242
-20
lines changed

5 files changed

+242
-20
lines changed
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# TensorRT-LLM Briton with Qwen/Qwen2-57B-A14B-MoE-int4
2+
3+
This is a Deployment for TensorRT-LLM Briton with Qwen/Qwen2-57B-A14B-MoE-int4. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral)
4+
5+
With Briton you get the following benefits by default:
6+
- *Lowest-latency* latency, beating frameworks such as vllm
7+
- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching.
8+
- *distributed inference* run large models (such as LLama-405B) tensor-parallel
9+
- *json-schema based structured output for any model*
10+
- *chunked prefilling* for long generation tasks
11+
12+
Optionally, you can also enable:
13+
- *speculative decoding* using an external draft model or self-speculative decoding
14+
- *fp8 quantization* deployments on H100, H200 and L4 GPUs
15+
16+
17+
# Examples:
18+
This deployment is specifically designed for the Hugging Face model [Qwen/Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct).
19+
Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models.
20+
21+
Qwen/Qwen2-57B-A14B-Instruct is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more.
22+
23+
24+
## Deployment with Truss
25+
26+
Before deployment:
27+
28+
1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
29+
2. Install the latest version of Truss: `pip install --upgrade truss`
30+
31+
32+
First, clone this repository:
33+
```sh
34+
git clone https://github.com/basetenlabs/truss-examples.git
35+
cd 11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2-57b-a14b-moe-int4
36+
```
37+
38+
With `11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2-57b-a14b-moe-int4` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted.
39+
40+
```sh
41+
truss push --publish
42+
# prints:
43+
# ✨ Model Briton-qwen-qwen2-57b-a14b-moe-int4-truss-example was successfully pushed ✨
44+
# 🪵 View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx
45+
```
46+
47+
## Call your model
48+
49+
### OpenAI compatible inference
50+
This solution is OpenAI compatible, which means you can use the OpenAI client library to interact with the model.
51+
52+
```python
53+
from openai import OpenAI
54+
import os
55+
56+
client = OpenAI(
57+
api_key=os.environ['BASETEN_API_KEY'],
58+
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
59+
)
60+
61+
# Default completion
62+
response_completion = client.completions.create(
63+
model="not_required",
64+
prompt="Q: Tell me everything about Baseten.co! A:",
65+
temperature=0.3,
66+
max_tokens=100,
67+
)
68+
69+
# Chat completion
70+
response_chat = client.chat.completions.create(
71+
model="",
72+
messages=[
73+
{"role": "user", "content": "Tell me everything about Baseten.co!"}
74+
],
75+
temperature=0.3,
76+
max_tokens=100,
77+
)
78+
79+
# Structured output
80+
from pydantic import BaseModel
81+
82+
class CalendarEvent(BaseModel):
83+
name: str
84+
date: str
85+
participants: list[str]
86+
87+
completion = client.beta.chat.completions.parse(
88+
model="not_required",
89+
messages=[
90+
{"role": "system", "content": "Extract the event information."},
91+
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
92+
],
93+
response_format=CalendarEvent,
94+
)
95+
96+
event = completion.choices[0].message.parsed
97+
98+
# If you model supports tool-calling, you can use the following example:
99+
tools = [{
100+
"type": "function",
101+
"function": {
102+
"name": "get_weather",
103+
"description": "Get current temperature for a given location.",
104+
"parameters": {
105+
"type": "object",
106+
"properties": {
107+
"location": {
108+
"type": "string",
109+
"description": "City and country e.g. Bogotá, Colombia"
110+
}
111+
},
112+
"required": [
113+
"location"
114+
],
115+
"additionalProperties": False
116+
},
117+
"strict": True
118+
}
119+
}]
120+
121+
completion = client.chat.completions.create(
122+
model="not_required",
123+
messages=[{"role": "user", "content": "What is the weather like in Paris today?"}],
124+
tools=tools
125+
)
126+
127+
print(completion.choices[0].message.tool_calls)
128+
```
129+
130+
131+
## Config.yaml
132+
By default, the following configuration is used for this deployment. This config uses `quantization_type=weights_int4`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16.
133+
134+
```yaml
135+
build_commands: []
136+
environment_variables: {}
137+
external_package_dirs: []
138+
model_metadata:
139+
example_model_input:
140+
max_tokens: 512
141+
messages:
142+
- content: Tell me everything you know about optimized inference.
143+
role: user
144+
stream: true
145+
temperature: 0.5
146+
tags:
147+
- openai-compatible
148+
model_name: Briton-qwen-qwen2-57b-a14b-moe-int4-truss-example
149+
python_version: py39
150+
requirements: []
151+
resources:
152+
accelerator: A100
153+
cpu: '1'
154+
memory: 10Gi
155+
use_gpu: true
156+
secrets: {}
157+
system_packages: []
158+
trt_llm:
159+
build:
160+
base_model: llama
161+
checkpoint_repository:
162+
repo: Qwen/Qwen2-57B-A14B-Instruct
163+
revision: main
164+
source: HF
165+
max_seq_len: 32768
166+
num_builder_gpus: 4
167+
quantization_config:
168+
calib_max_seq_length: 4096
169+
calib_size: 3072
170+
quantization_type: weights_int4
171+
tensor_parallel_count: 1
172+
runtime:
173+
enable_chunked_context: true
174+
175+
```
176+
177+
## Support
178+
If you have any questions or need assistance, please open an issue in this repository or contact our support team.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
build_commands: []
2+
environment_variables: {}
3+
external_package_dirs: []
4+
model_metadata:
5+
example_model_input:
6+
max_tokens: 512
7+
messages:
8+
- content: Tell me everything you know about optimized inference.
9+
role: user
10+
stream: true
11+
temperature: 0.5
12+
tags:
13+
- openai-compatible
14+
model_name: Briton-qwen-qwen2-57b-a14b-moe-int4-truss-example
15+
python_version: py39
16+
requirements: []
17+
resources:
18+
accelerator: A100
19+
cpu: '1'
20+
memory: 10Gi
21+
use_gpu: true
22+
secrets: {}
23+
system_packages: []
24+
trt_llm:
25+
build:
26+
base_model: llama
27+
checkpoint_repository:
28+
repo: Qwen/Qwen2-57B-A14B-Instruct
29+
revision: main
30+
source: HF
31+
max_seq_len: 32768
32+
num_builder_gpus: 4
33+
quantization_config:
34+
calib_max_seq_length: 4096
35+
calib_size: 3072
36+
quantization_type: weights_int4
37+
tensor_parallel_count: 1
38+
runtime:
39+
enable_chunked_context: true

11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwq-32b-reasoning-fp8/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@ environment_variables: {}
33
external_package_dirs: []
44
model_metadata:
55
example_model_input:
6-
max_tokens: 2048
6+
max_tokens: 512
77
messages:
88
- content: Tell me everything you know about optimized inference.
99
role: user
1010
stream: true
1111
temperature: 0.5
1212
tags:
1313
- openai-compatible
14-
model_name: Qwen QwQ 32B
14+
model_name: Briton-qwen-qwq-32b-reasoning-fp8-truss-example
1515
python_version: py39
1616
requirements: []
1717
resources:

11-embeddings-reranker-classification-tensorrt/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ Optionally, you can also enable:
6666
Examples:
6767
- [Qwen/QwQ-32B-reasoning-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwq-32b-reasoning-fp8)
6868
- [Qwen/QwQ-32B-reasoning-with-speculative-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwq-32b-reasoning-with-speculative-fp8)
69+
- [Qwen/Qwen2-57B-A14B-MoE-int4-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2-57b-a14b-moe-int4)
6970
- [Qwen/Qwen2.5-72B-Instruct-tp2-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2.5-72b-instruct-tp2-fp8)
7071
- [Qwen/Qwen2.5-7B-Instruct-with-speculative-lookahead-decoding-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-qwen-qwen2.5-7b-instruct-with-speculative-lookahead-decoding-fp8)
7172
- [deepseek-ai/DeepSeek-R1-Distill-Llama-70B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-llama-70b-fp8)

11-embeddings-reranker-classification-tensorrt/templating/generate_templates.py

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -209,9 +209,8 @@ def make_truss_config(self, dp):
209209
self.trt_config.build.max_seq_len = max_position_embeddings
210210
assert max_position_embeddings >= 512, "Model needs to have at least 512 tokens"
211211
if (
212-
hf_cfg.model_type == "qwen2"
213-
and self.trt_config.build.quantization_type
214-
in [TrussTRTLLMQuantizationType.FP8_KV, TrussTRTLLMQuantizationType.FP8]
212+
hf_cfg.model_type in ["qwen2", "qwen2_moe"]
213+
and self.trt_config.build.quantization_type is not None
215214
):
216215
if (
217216
self.trt_config.build.quantization_type
@@ -220,19 +219,13 @@ def make_truss_config(self, dp):
220219
raise ValueError(
221220
f"Qwen2 models do not support FP8_KV quantization / have quality issues with this dtype - please use regular FP8 for now in the model library {dp.hf_model_id}"
222221
)
223-
elif (
224-
self.trt_config.build.quantization_type
225-
== TrussTRTLLMQuantizationType.FP8
226-
):
227-
# increase the quantization example size for qwen2 models
228-
self.trt_config.build.quantization_config = (
229-
TrussTRTQuantizationConfiguration(
230-
calib_size=3072,
231-
calib_max_seq_length=min(
232-
4096, self.trt_config.build.max_seq_len
233-
),
234-
)
222+
# increase the quantization example size for qwen2 models
223+
self.trt_config.build.quantization_config = (
224+
TrussTRTQuantizationConfiguration(
225+
calib_size=3072,
226+
calib_max_seq_length=min(4096, self.trt_config.build.max_seq_len),
235227
)
228+
)
236229

237230
secrets = {}
238231
if dp.is_gated:
@@ -897,7 +890,7 @@ def llamalike_config(
897890
# config for meta-llama/Llama-3.3-70B-Instruct (FP8)
898891
build_kwargs = dict()
899892
runtime_kwargs = dict()
900-
if quant != TrussTRTLLMQuantizationType.NO_QUANT:
893+
if quant != TrussTRTLLMQuantizationType.NO_QUANT and tp in [1, 2]:
901894
if tp == 1:
902895
build_kwargs["num_builder_gpus"] = 4
903896
if quant == TrussTRTLLMQuantizationType.FP8_KV:
@@ -927,8 +920,6 @@ def llamalike_config(
927920
)
928921

929922
if quant in [
930-
TrussTRTLLMQuantizationType.WEIGHTS_ONLY_INT4,
931-
TrussTRTLLMQuantizationType.WEIGHTS_ONLY_INT8,
932923
TrussTRTLLMQuantizationType.WEIGHTS_INT4_KV_INT8,
933924
]:
934925
config.build.plugin_configuration.use_paged_context_fmha = False
@@ -1103,6 +1094,19 @@ def llamalike_spec_dec(
11031094
)
11041095
),
11051096
),
1097+
Deployment(
1098+
"Qwen/Qwen2-57B-A14B-MoE-int4",
1099+
"Qwen/Qwen2-57B-A14B-Instruct",
1100+
Accelerator.A100,
1101+
TextGen(),
1102+
solution=Briton(
1103+
trt_config=llamalike_config(
1104+
repoid="Qwen/Qwen2-57B-A14B-Instruct",
1105+
tp=1,
1106+
quant=TrussTRTLLMQuantizationType.WEIGHTS_ONLY_INT4,
1107+
)
1108+
),
1109+
),
11061110
# mistralai/Mistral-Small-24B-Instruct-2501
11071111
Deployment(
11081112
"mistralai/Mistral-Small-24B-Instruct-2501",

0 commit comments

Comments
 (0)