|
| 1 | +## Qwen3.5 NVFP4 MoE Example |
| 2 | + |
| 3 | +This example quantizes the Qwen3.5-122B-A10B sparse MoE model to NVFP4 (weights and activations quantized to FP4) using calibration data. |
| 4 | + |
| 5 | +NOTE: This example requires `transformers >= v5`. |
| 6 | + |
| 7 | +### Code Walkthrough |
| 8 | + |
| 9 | +Let's walk through the main steps of the quantization process: |
| 10 | +1. Load model |
| 11 | +2. Load and preprocess calibration dataset |
| 12 | +3. Configure quantization algorithm and scheme |
| 13 | +4. Apply quantization |
| 14 | +5. Save to disk in compressed-tensors format |
| 15 | + |
| 16 | +### 1. Load Model |
| 17 | + |
| 18 | +```python |
| 19 | +import torch |
| 20 | +from compressed_tensors.utils import save_mtp_tensors_to_checkpoint |
| 21 | +from datasets import load_dataset |
| 22 | +from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration |
| 23 | + |
| 24 | +from llmcompressor import oneshot |
| 25 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 26 | + |
| 27 | +MODEL_ID = "Qwen/Qwen3.5-122B-A10B" |
| 28 | + |
| 29 | +# Load model. |
| 30 | +model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto") |
| 31 | +processor = AutoProcessor.from_pretrained(MODEL_ID) |
| 32 | +``` |
| 33 | + |
| 34 | +### 2. Load and Preprocess Calibration Dataset |
| 35 | + |
| 36 | +```python |
| 37 | +NUM_CALIBRATION_SAMPLES = 256 |
| 38 | +MAX_SEQUENCE_LENGTH = 4096 |
| 39 | + |
| 40 | +ds = load_dataset( |
| 41 | + "HuggingFaceH4/ultrachat_200k", |
| 42 | + split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]", |
| 43 | +) |
| 44 | +ds = ds.select_columns(["messages"]) |
| 45 | +ds = ds.shuffle(seed=42) |
| 46 | + |
| 47 | + |
| 48 | +def preprocess_function(example): |
| 49 | + messages = [ |
| 50 | + {"role": m["role"], "content": [{"type": "text", "text": m["content"]}]} |
| 51 | + for m in example["messages"] |
| 52 | + ] |
| 53 | + return processor.apply_chat_template( |
| 54 | + messages, |
| 55 | + return_tensors="pt", |
| 56 | + padding=False, |
| 57 | + truncation=True, |
| 58 | + max_length=MAX_SEQUENCE_LENGTH, |
| 59 | + tokenize=True, |
| 60 | + add_special_tokens=False, |
| 61 | + return_dict=True, |
| 62 | + add_generation_prompt=False, |
| 63 | + ) |
| 64 | + |
| 65 | + |
| 66 | +ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names) |
| 67 | + |
| 68 | + |
| 69 | +def data_collator(batch): |
| 70 | + assert len(batch) == 1 |
| 71 | + return {key: torch.tensor(value) for key, value in batch[0].items()} |
| 72 | +``` |
| 73 | + |
| 74 | +### 3. Configure Quantization Algorithm and Scheme |
| 75 | + |
| 76 | +In this case, we are doing the following: |
| 77 | +- Quantize the weights and activations to FP4 via calibration-based PTQ |
| 78 | +- Skip `lm_head`, visual layers, MoE gate projections, embedding layers, shared expert gates, and linear attention layers |
| 79 | +- MTP layers are not loaded through `Qwen3_5MoeForConditionalGeneration`, so there is no need to include them in the ignore list |
| 80 | + |
| 81 | +```python |
| 82 | +recipe = QuantizationModifier( |
| 83 | + targets="Linear", |
| 84 | + scheme="NVFP4", |
| 85 | + ignore=[ |
| 86 | + "re:.*lm_head", |
| 87 | + "re:visual.*", |
| 88 | + "re:model.visual.*", |
| 89 | + "re:.*mlp.gate$", |
| 90 | + "re:.*embed_tokens$", |
| 91 | + "re:.*shared_expert_gate$", |
| 92 | + "re:.*linear_attn.*", |
| 93 | + ], |
| 94 | +) |
| 95 | +``` |
| 96 | + |
| 97 | +### 4. Apply Quantization |
| 98 | + |
| 99 | +`moe_calibrate_all_experts=True` ensures all MoE experts receive calibration data, which improves quantization quality for sparse MoE models. |
| 100 | + |
| 101 | +```python |
| 102 | +oneshot( |
| 103 | + model=model, |
| 104 | + recipe=recipe, |
| 105 | + dataset=ds, |
| 106 | + max_seq_length=MAX_SEQUENCE_LENGTH, |
| 107 | + num_calibration_samples=NUM_CALIBRATION_SAMPLES, |
| 108 | + moe_calibrate_all_experts=True, |
| 109 | + data_collator=data_collator, |
| 110 | +) |
| 111 | +``` |
| 112 | + |
| 113 | +### 5. Save to Disk in Compressed-Tensors Format |
| 114 | + |
| 115 | +```python |
| 116 | +SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" |
| 117 | +model.save_pretrained(SAVE_DIR) |
| 118 | +processor.save_pretrained(SAVE_DIR) |
| 119 | + |
| 120 | +# MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration |
| 121 | +# Save them as-is from the original checkpoint into the quantized output. |
| 122 | +save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR) |
| 123 | +``` |
0 commit comments