Skip to content

Commit e97c50a

Browse files
dsikkakylesayrs
andauthored
[MTP] Add MTP Layers to final checkpoint (#2486)
SUMMARY: - Update examples to save mtp layers; requires: vllm-project/compressed-tensors#640 - Fix how the MoE example handles the processor - Update repo readme with Qwen 3.5 details --------- Signed-off-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
1 parent ed2d7f6 commit e97c50a

File tree

7 files changed

+60
-39
lines changed

7 files changed

+60
-39
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
3737

3838
Some of the exciting new features include:
3939

40+
* **Qwen3.5 Support**: Qwen 3.5 can now be quantized using LLM Compressor. You will need to update your local transformers version using `uv pip install --upgrade transformers` and install LLM Compressor from source if using `<0.11`. Once updated, you should be able to run examples for the [MoE](examples/quantization_w4a4_fp4/qwen3_5_example.py) and [non-MoE](examples/quantization_w4a4_fp4/qwen3_5_example.py) variants of Qwen 3.5 end-to-end. For models quantized and published by the RedHat team, consider using the [NVFP4](https://huggingface.co/RedHatAI/Qwen3.5-122B-A10B-NVFP4) and FP8 checkpoints for [Qwen3.5-122B](https://huggingface.co/RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic) and [Qwen3.5-397B](https://huggingface.co/RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic).
4041
* **Updated offloading and model loading support**: Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](docs/guides/big_models_and_distributed/model_loading.md).
4142
* **Distributed GPTQ Support**: GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](examples/quantization_w4a16/llama3_ddp_example.py).
4243
* **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models

examples/quantization_w4a16_fp4/mxfp4/qwen3.5_example.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from compressed_tensors.offload import dispatch_model
2+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
23
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
34

45
from llmcompressor import oneshot
@@ -14,16 +15,18 @@
1415
# Configure the quantization algorithm and scheme.
1516
# In this case, we:
1617
# * quantize the weights to fp4 with per group 32 via ptq
17-
# * skip the visual encoder, lm_head, linear attention (Gated DeltaNet
18-
# fused projections are incompatible with microscale formats), and MTP modules
18+
# * skip the visual encoder, lm_head, and linear attention
19+
# (Gated DeltaNet fused projections are incompatible with microscale formats)
20+
21+
# No need to include mtp layers as they are not loaded
22+
# through Qwen3_5ForConditionalGeneration
1923
recipe = QuantizationModifier(
2024
targets="Linear",
2125
scheme="MXFP4A16",
2226
ignore=[
2327
"lm_head",
2428
"re:.*visual.*",
2529
"re:.*linear_attn.*",
26-
"re:.*mtp.*",
2730
],
2831
)
2932

@@ -45,3 +48,7 @@
4548
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-MXFP4A16"
4649
model.save_pretrained(SAVE_DIR, save_compressed=True)
4750
processor.save_pretrained(SAVE_DIR)
51+
52+
# MTP layers are excluded from the model through Qwen3_5ForConditionalGeneration
53+
# Save them as-is from the original checkpoint into the quantized output.
54+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)

examples/quantization_w4a16_fp4/nvfp4/qwen3.5_example.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from compressed_tensors.offload import dispatch_model
2+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
23
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
34

45
from llmcompressor import oneshot
@@ -14,16 +15,18 @@
1415
# Configure the quantization algorithm and scheme.
1516
# In this case, we:
1617
# * quantize the weights to fp4 with per group 16 via ptq
17-
# * skip the visual encoder, lm_head, linear attention (Gated DeltaNet
18-
# fused projections are incompatible with NVFP4), and MTP modules
18+
# * skip the visual encoder, lm_head, linear attention
19+
# (Gated DeltaNet fused projections are incompatible with microscale formats)
20+
21+
# No need to include mtp layers as they are not loaded
22+
# through Qwen3_5ForConditionalGeneration
1923
recipe = QuantizationModifier(
2024
targets="Linear",
2125
scheme="NVFP4A16",
2226
ignore=[
2327
"lm_head",
2428
"re:.*visual.*",
2529
"re:.*linear_attn.*",
26-
"re:.*mtp.*",
2730
],
2831
)
2932

@@ -45,3 +48,7 @@
4548
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4A16"
4649
model.save_pretrained(SAVE_DIR, save_compressed=True)
4750
processor.save_pretrained(SAVE_DIR)
51+
52+
# MTP layers are excluded from the model through Qwen3_5ForConditionalGeneration
53+
# Save them as-is from the original checkpoint into the quantized output.
54+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)
Lines changed: 33 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
import torch
2+
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
13
from datasets import load_dataset
2-
from transformers import AutoTokenizer, Qwen3_5MoeForConditionalGeneration
4+
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration
35

46
from llmcompressor import oneshot
57
from llmcompressor.modifiers.quantization import QuantizationModifier
@@ -10,9 +12,10 @@
1012

1113
# Load model.
1214
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
13-
processor = AutoTokenizer.from_pretrained(MODEL_ID)
14-
15+
processor = AutoProcessor.from_pretrained(MODEL_ID)
1516

17+
# No need to include mtp layers as they are not loaded
18+
# through Qwen3_5MoeForConditionalGeneration
1619
recipe = QuantizationModifier(
1720
targets="Linear",
1821
scheme="NVFP4",
@@ -30,44 +33,39 @@
3033
NUM_CALIBRATION_SAMPLES = 256
3134
MAX_SEQUENCE_LENGTH = 4096
3235

33-
# Load datasets and preprocess.
34-
samples_per_dataset = NUM_CALIBRATION_SAMPLES
35-
36-
ds_ultrachat = load_dataset(
36+
ds = load_dataset(
3737
"HuggingFaceH4/ultrachat_200k",
38-
split=f"train_sft[:{samples_per_dataset}]",
38+
split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]",
3939
)
40-
41-
# Both datasets share a "messages" column with the same chat format.
42-
# Keep only that column so we can concatenate them.
43-
ds = ds_ultrachat.select_columns(["messages"])
40+
ds = ds.select_columns(["messages"])
4441
ds = ds.shuffle(seed=42)
4542

4643

47-
def preprocess(example):
48-
return {
49-
"text": processor.apply_chat_template(
50-
example["messages"],
51-
tokenize=False,
52-
)
53-
}
54-
55-
56-
ds = ds.map(preprocess)
57-
58-
59-
# Tokenize inputs.
60-
def tokenize(sample):
61-
return processor(
62-
sample["text"],
44+
def preprocess_function(example):
45+
messages = [
46+
{"role": m["role"], "content": [{"type": "text", "text": m["content"]}]}
47+
for m in example["messages"]
48+
]
49+
return processor.apply_chat_template(
50+
messages,
51+
return_tensors="pt",
6352
padding=False,
64-
max_length=MAX_SEQUENCE_LENGTH,
6553
truncation=True,
54+
max_length=MAX_SEQUENCE_LENGTH,
55+
tokenize=True,
6656
add_special_tokens=False,
57+
return_dict=True,
58+
add_generation_prompt=False,
6759
)
6860

6961

70-
ds = ds.map(tokenize, remove_columns=ds.column_names)
62+
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
63+
64+
65+
def data_collator(batch):
66+
assert len(batch) == 1
67+
return {key: torch.tensor(value) for key, value in batch[0].items()}
68+
7169

7270
# Apply quantization.
7371
oneshot(
@@ -77,9 +75,14 @@ def tokenize(sample):
7775
max_seq_length=MAX_SEQUENCE_LENGTH,
7876
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
7977
moe_calibrate_all_experts=True,
78+
data_collator=data_collator,
8079
)
8180

8281
# Save to disk in compressed-tensors format.
8382
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
8483
model.save_pretrained(SAVE_DIR)
8584
processor.save_pretrained(SAVE_DIR)
85+
86+
# MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration
87+
# Save them as-is from the original checkpoint into the quantized output.
88+
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)

src/llmcompressor/entrypoints/model_free/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,13 @@
77
from compressed_tensors.entrypoints.convert import (
88
Converter,
99
exec_jobs,
10+
)
11+
from compressed_tensors.quantization import QuantizationScheme
12+
from compressed_tensors.utils.safetensors_load import (
1013
get_checkpoint_files,
1114
is_weights_file,
1215
update_safetensors_index,
1316
)
14-
from compressed_tensors.quantization import QuantizationScheme
1517
from loguru import logger
1618

1719
from llmcompressor.entrypoints.model_free.helpers import gpu_if_available

src/llmcompressor/entrypoints/model_free/reindex_fused_weights.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
import torch
99
import tqdm
10-
from compressed_tensors.entrypoints.convert import (
10+
from compressed_tensors.utils.safetensors_load import (
1111
get_checkpoint_files,
1212
is_weights_file,
1313
update_safetensors_index,

src/llmcompressor/entrypoints/model_free/save_utils.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,13 @@
99
TRANSFORM_CONFIG_NAME,
1010
)
1111
from compressed_tensors.config import CompressionFormat
12-
from compressed_tensors.entrypoints.convert import Converter, find_config_path
12+
from compressed_tensors.entrypoints.convert import Converter
1313
from compressed_tensors.quantization import (
1414
QuantizationConfig,
1515
QuantizationScheme,
1616
QuantizationStatus,
1717
)
18+
from compressed_tensors.utils.safetensors_load import find_config_path
1819
from loguru import logger
1920
from pydantic import ValidationError
2021

0 commit comments

Comments
 (0)