Skip to content

Commit 36fab73

Browse files
authored
[Docs] Add Qwen3.5 to Key Models (#2502)
SUMMARY: - Example walkthrough - Notes about transformers version - Should land https://github.com/vllm-project/llm-compressor/pulls/dsikka to make these examples make sense
1 parent e97c50a commit 36fab73

File tree

5 files changed

+232
-1
lines changed

5 files changed

+232
-1
lines changed

docs/.nav.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ nav:
1919
- Qwen3:
2020
- key-models/qwen3/index.md
2121
- FP8 Example: key-models/qwen3/fp8-example.md
22+
- Qwen3.5:
23+
- key-models/qwen3.5/index.md
24+
- NVFP4A16 VL Example: key-models/qwen3.5/nvfp4-vl-example.md
25+
- NVFP4 MoE Example: key-models/qwen3.5/nvfp4-moe-example.md
2226
- Kimi-K2:
2327
- key-models/kimi-k2/index.md
2428
- FP8 Example: key-models/kimi-k2/fp8-example.md

docs/key-models/index.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Key Models
22

3-
The following models are among the most commonly used with LLM Compressor: Llama 4, Qwen3, Kimi-K2, and Mistral Large 3. Each model page contains quantization examples with tested configurations and recommended parameters.
3+
The following models are among the most commonly used with LLM Compressor: Llama 4, Qwen3, Qwen3.5, Kimi-K2, and Mistral Large 3. Each model page contains quantization examples with tested configurations and recommended parameters.
44

55
<div class="grid cards" markdown>
66

@@ -20,6 +20,14 @@ The following models are among the most commonly used with LLM Compressor: Llama
2020

2121
[:octicons-arrow-right-24: Qwen3](qwen3/index.md)
2222

23+
- **Qwen3.5**
24+
25+
---
26+
27+
Qwen3.5 dense vision-language and sparse MoE models.
28+
29+
[:octicons-arrow-right-24: Qwen3.5](qwen3.5/index.md)
30+
2331
- **Kimi-K2**
2432

2533
---

docs/key-models/qwen3.5/index.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Qwen3.5
2+
3+
Quantization examples for the Qwen3.5 family of models, including dense vision-language and sparse MoE variants.
4+
5+
> **Note:** These examples require `transformers >= v5`, which can be installed with:
6+
> ```bash
7+
> uv pip install --upgrade transformers
8+
> ```
9+
> With this, the examples can run end-to-end on `main`. You may also need to update the version of `transformers` in your vLLM environment in order for the tokenizer to be properly applied.
10+
11+
- [NVFP4A16 Vision-Language Example](nvfp4-vl-example.md)
12+
- [NVFP4 MoE Example](nvfp4-moe-example.md)
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
## Qwen3.5 NVFP4 MoE Example
2+
3+
This example quantizes the Qwen3.5-122B-A10B sparse MoE model to NVFP4 (weights and activations quantized to FP4) using calibration data.
4+
5+
NOTE: This example requires `transformers >= v5`.
6+
7+
### Code Walkthrough
8+
9+
Let's walk through the main steps of the quantization process:
10+
1. Load model
11+
2. Load and preprocess calibration dataset
12+
3. Configure quantization algorithm and scheme
13+
4. Apply quantization
14+
5. Save to disk in compressed-tensors format
15+
16+
### 1. Load Model
17+
18+
```python
19+
import torch
20+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
21+
from datasets import load_dataset
22+
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration
23+
24+
from llmcompressor import oneshot
25+
from llmcompressor.modifiers.quantization import QuantizationModifier
26+
27+
MODEL_ID = "Qwen/Qwen3.5-122B-A10B"
28+
29+
# Load model.
30+
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
31+
processor = AutoProcessor.from_pretrained(MODEL_ID)
32+
```
33+
34+
### 2. Load and Preprocess Calibration Dataset
35+
36+
```python
37+
NUM_CALIBRATION_SAMPLES = 256
38+
MAX_SEQUENCE_LENGTH = 4096
39+
40+
ds = load_dataset(
41+
"HuggingFaceH4/ultrachat_200k",
42+
split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]",
43+
)
44+
ds = ds.select_columns(["messages"])
45+
ds = ds.shuffle(seed=42)
46+
47+
48+
def preprocess_function(example):
49+
messages = [
50+
{"role": m["role"], "content": [{"type": "text", "text": m["content"]}]}
51+
for m in example["messages"]
52+
]
53+
return processor.apply_chat_template(
54+
messages,
55+
return_tensors="pt",
56+
padding=False,
57+
truncation=True,
58+
max_length=MAX_SEQUENCE_LENGTH,
59+
tokenize=True,
60+
add_special_tokens=False,
61+
return_dict=True,
62+
add_generation_prompt=False,
63+
)
64+
65+
66+
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
67+
68+
69+
def data_collator(batch):
70+
assert len(batch) == 1
71+
return {key: torch.tensor(value) for key, value in batch[0].items()}
72+
```
73+
74+
### 3. Configure Quantization Algorithm and Scheme
75+
76+
In this case, we are doing the following:
77+
- Quantize the weights and activations to FP4 via calibration-based PTQ
78+
- Skip `lm_head`, visual layers, MoE gate projections, embedding layers, shared expert gates, and linear attention layers
79+
- MTP layers are not loaded through `Qwen3_5MoeForConditionalGeneration`, so there is no need to include them in the ignore list
80+
81+
```python
82+
recipe = QuantizationModifier(
83+
targets="Linear",
84+
scheme="NVFP4",
85+
ignore=[
86+
"re:.*lm_head",
87+
"re:visual.*",
88+
"re:model.visual.*",
89+
"re:.*mlp.gate$",
90+
"re:.*embed_tokens$",
91+
"re:.*shared_expert_gate$",
92+
"re:.*linear_attn.*",
93+
],
94+
)
95+
```
96+
97+
### 4. Apply Quantization
98+
99+
`moe_calibrate_all_experts=True` ensures all MoE experts receive calibration data, which improves quantization quality for sparse MoE models.
100+
101+
```python
102+
oneshot(
103+
model=model,
104+
recipe=recipe,
105+
dataset=ds,
106+
max_seq_length=MAX_SEQUENCE_LENGTH,
107+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
108+
moe_calibrate_all_experts=True,
109+
data_collator=data_collator,
110+
)
111+
```
112+
113+
### 5. Save to Disk in Compressed-Tensors Format
114+
115+
```python
116+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
117+
model.save_pretrained(SAVE_DIR)
118+
processor.save_pretrained(SAVE_DIR)
119+
120+
# MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration
121+
# Save them as-is from the original checkpoint into the quantized output.
122+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)
123+
```
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
## Qwen3.5 NVFP4A16 Vision-Language Example
2+
3+
This example quantizes the Qwen3.5-27B vision-language model to NVFP4A16 (weights quantized to FP4 with per-group-16 granularity, activations in FP16) using data-free PTQ.
4+
5+
### Code Walkthrough
6+
7+
Let's walk through the main steps of the quantization process:
8+
1. Load model
9+
2. Configure quantization algorithm and scheme
10+
3. Apply quantization
11+
4. Run sample generation
12+
5. Save to disk in compressed-tensors format
13+
14+
### 1. Load Model
15+
16+
```python
17+
from compressed_tensors.offload import dispatch_model
18+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
19+
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
20+
21+
from llmcompressor import oneshot
22+
from llmcompressor.modifiers.quantization import QuantizationModifier
23+
24+
# Load model.
25+
MODEL_ID = "Qwen/Qwen3.5-27B"
26+
model = Qwen3_5ForConditionalGeneration.from_pretrained(
27+
MODEL_ID, dtype="auto", trust_remote_code=True
28+
)
29+
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
30+
```
31+
32+
### 2. Configure Quantization Algorithm and Scheme
33+
34+
In this case, we are doing the following:
35+
- Quantize the weights to FP4 with per-group-16 granularity via data-free PTQ
36+
- Skip the visual encoder, `lm_head`, and linear attention layers (Gated DeltaNet fused projections are incompatible with NVFP4)
37+
- MTP layers are not loaded through `Qwen3_5ForConditionalGeneration`, so there is no need to include them in the ignore list
38+
39+
```python
40+
# No need to include mtp layers as they are not loaded
41+
# through Qwen3_5ForConditionalGeneration
42+
recipe = QuantizationModifier(
43+
targets="Linear",
44+
scheme="NVFP4A16",
45+
ignore=[
46+
"lm_head",
47+
"re:.*visual.*",
48+
"re:.*linear_attn.*",
49+
],
50+
)
51+
```
52+
53+
### 3. Apply Quantization
54+
55+
```python
56+
oneshot(model=model, recipe=recipe)
57+
```
58+
59+
### 4. Run Sample Generation
60+
61+
```python
62+
print("\n\n========== SAMPLE GENERATION ==============")
63+
dispatch_model(model)
64+
messages = [{"role": "user", "content": "Hello my name is"}]
65+
prompt = processor.apply_chat_template(
66+
messages, tokenize=False, add_generation_prompt=True
67+
)
68+
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
69+
output = model.generate(**inputs, max_new_tokens=100)
70+
print(processor.decode(output[0], skip_special_tokens=True))
71+
print("==========================================\n\n")
72+
```
73+
74+
### 5. Save to Disk in Compressed-Tensors Format
75+
76+
```python
77+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4A16"
78+
model.save_pretrained(SAVE_DIR)
79+
processor.save_pretrained(SAVE_DIR)
80+
81+
# MTP layers are excluded from the model through Qwen3_5ForConditionalGeneration
82+
# Save them as-is from the original checkpoint into the quantized output.
83+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)
84+
```

0 commit comments

Comments
 (0)