Skip to content

Commit 6a02f09

Browse files
committed
Add qwen3.5 docs
1 parent cf3bd64 commit 6a02f09

File tree

3 files changed

+219
-0
lines changed

3 files changed

+219
-0
lines changed

docs/key-models/qwen3.5/index.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Qwen3.5
2+
3+
Quantization examples for the Qwen3.5 family of models, including dense vision-language and sparse MoE variants.
4+
5+
> **Note:** These examples require `transformers >= v5`, which can be installed with:
6+
> ```bash
7+
> uv pip install --upgrade transformers
8+
> ```
9+
> With this, the examples can run end-to-end on `main`. You may also need to update the version of `transformers` in your vLLM environment in order for the tokenizer to be properly applied.
10+
11+
- [NVFP4A16 Vision-Language Example](nvfp4-vl-example.md)
12+
- [NVFP4 MoE Example](nvfp4-moe-example.md)
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
## Qwen3.5 NVFP4 MoE Example
2+
3+
This example quantizes the Qwen3.5-122B-A10B sparse MoE model to NVFP4 (weights and activations quantized to FP4) using calibration data.
4+
5+
NOTE: This example requires `transformers >= v5`.
6+
7+
### Code Walkthrough
8+
9+
Let's walk through the main steps of the quantization process:
10+
1. Load model
11+
2. Load and preprocess calibration dataset
12+
3. Configure quantization algorithm and scheme
13+
4. Apply quantization
14+
5. Save to disk in compressed-tensors format
15+
16+
### 1. Load Model
17+
18+
```python
19+
import torch
20+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
21+
from datasets import load_dataset
22+
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration
23+
24+
from llmcompressor import oneshot
25+
from llmcompressor.modifiers.quantization import QuantizationModifier
26+
27+
MODEL_ID = "Qwen/Qwen3.5-122B-A10B"
28+
29+
# Load model.
30+
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
31+
processor = AutoProcessor.from_pretrained(MODEL_ID)
32+
```
33+
34+
### 2. Load and Preprocess Calibration Dataset
35+
36+
```python
37+
NUM_CALIBRATION_SAMPLES = 256
38+
MAX_SEQUENCE_LENGTH = 4096
39+
40+
ds = load_dataset(
41+
"HuggingFaceH4/ultrachat_200k",
42+
split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]",
43+
)
44+
ds = ds.select_columns(["messages"])
45+
ds = ds.shuffle(seed=42)
46+
47+
48+
def preprocess_function(example):
49+
messages = [
50+
{"role": m["role"], "content": [{"type": "text", "text": m["content"]}]}
51+
for m in example["messages"]
52+
]
53+
return processor.apply_chat_template(
54+
messages,
55+
return_tensors="pt",
56+
padding=False,
57+
truncation=True,
58+
max_length=MAX_SEQUENCE_LENGTH,
59+
tokenize=True,
60+
add_special_tokens=False,
61+
return_dict=True,
62+
add_generation_prompt=False,
63+
)
64+
65+
66+
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
67+
68+
69+
def data_collator(batch):
70+
assert len(batch) == 1
71+
return {key: torch.tensor(value) for key, value in batch[0].items()}
72+
```
73+
74+
### 3. Configure Quantization Algorithm and Scheme
75+
76+
In this case, we are doing the following:
77+
- Quantize the weights and activations to FP4 via calibration-based PTQ
78+
- Skip `lm_head`, visual layers, MoE gate projections, embedding layers, shared expert gates, and linear attention layers
79+
- MTP layers are not loaded through `Qwen3_5MoeForConditionalGeneration`, so there is no need to include them in the ignore list
80+
81+
```python
82+
recipe = QuantizationModifier(
83+
targets="Linear",
84+
scheme="NVFP4",
85+
ignore=[
86+
"re:.*lm_head",
87+
"re:visual.*",
88+
"re:model.visual.*",
89+
"re:.*mlp.gate$",
90+
"re:.*embed_tokens$",
91+
"re:.*shared_expert_gate$",
92+
"re:.*linear_attn.*",
93+
],
94+
)
95+
```
96+
97+
### 4. Apply Quantization
98+
99+
`moe_calibrate_all_experts=True` ensures all MoE experts receive calibration data, which improves quantization quality for sparse MoE models.
100+
101+
```python
102+
oneshot(
103+
model=model,
104+
recipe=recipe,
105+
dataset=ds,
106+
max_seq_length=MAX_SEQUENCE_LENGTH,
107+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
108+
moe_calibrate_all_experts=True,
109+
data_collator=data_collator,
110+
)
111+
```
112+
113+
### 5. Save to Disk in Compressed-Tensors Format
114+
115+
```python
116+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
117+
model.save_pretrained(SAVE_DIR)
118+
processor.save_pretrained(SAVE_DIR)
119+
120+
# MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration
121+
# Save them as-is from the original checkpoint into the quantized output.
122+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)
123+
```
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
## Qwen3.5 NVFP4A16 Vision-Language Example
2+
3+
This example quantizes the Qwen3.5-27B vision-language model to NVFP4A16 (weights quantized to FP4 with per-group-16 granularity, activations in FP16) using data-free PTQ.
4+
5+
### Code Walkthrough
6+
7+
Let's walk through the main steps of the quantization process:
8+
1. Load model
9+
2. Configure quantization algorithm and scheme
10+
3. Apply quantization
11+
4. Run sample generation
12+
5. Save to disk in compressed-tensors format
13+
14+
### 1. Load Model
15+
16+
```python
17+
from compressed_tensors.offload import dispatch_model
18+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
19+
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
20+
21+
from llmcompressor import oneshot
22+
from llmcompressor.modifiers.quantization import QuantizationModifier
23+
24+
# Load model.
25+
MODEL_ID = "Qwen/Qwen3.5-27B"
26+
model = Qwen3_5ForConditionalGeneration.from_pretrained(
27+
MODEL_ID, dtype="auto", trust_remote_code=True
28+
)
29+
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
30+
```
31+
32+
### 2. Configure Quantization Algorithm and Scheme
33+
34+
In this case, we are doing the following:
35+
- Quantize the weights to FP4 with per-group-16 granularity via data-free PTQ
36+
- Skip the visual encoder, `lm_head`, and linear attention layers (Gated DeltaNet fused projections are incompatible with NVFP4)
37+
- MTP layers are not loaded through `Qwen3_5ForConditionalGeneration`, so there is no need to include them in the ignore list
38+
39+
```python
40+
# No need to include mtp layers as they are not loaded
41+
# through Qwen3_5ForConditionalGeneration
42+
recipe = QuantizationModifier(
43+
targets="Linear",
44+
scheme="NVFP4A16",
45+
ignore=[
46+
"lm_head",
47+
"re:.*visual.*",
48+
"re:.*linear_attn.*",
49+
],
50+
)
51+
```
52+
53+
### 3. Apply Quantization
54+
55+
```python
56+
oneshot(model=model, recipe=recipe)
57+
```
58+
59+
### 4. Run Sample Generation
60+
61+
```python
62+
print("\n\n========== SAMPLE GENERATION ==============")
63+
dispatch_model(model)
64+
messages = [{"role": "user", "content": "Hello my name is"}]
65+
prompt = processor.apply_chat_template(
66+
messages, tokenize=False, add_generation_prompt=True
67+
)
68+
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
69+
output = model.generate(**inputs, max_new_tokens=100)
70+
print(processor.decode(output[0], skip_special_tokens=True))
71+
print("==========================================\n\n")
72+
```
73+
74+
### 5. Save to Disk in Compressed-Tensors Format
75+
76+
```python
77+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4A16"
78+
model.save_pretrained(SAVE_DIR, save_compressed=True)
79+
processor.save_pretrained(SAVE_DIR)
80+
81+
# MTP layers are excluded from the model through Qwen3_5ForConditionalGeneration
82+
# Save them as-is from the original checkpoint into the quantized output.
83+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)
84+
```

0 commit comments

Comments
 (0)