Skip to content

Commit ba42881

Browse files
committed
Update documentation
1 parent b04f957 commit ba42881

File tree

4 files changed

+13
-18
lines changed

4 files changed

+13
-18
lines changed

examples/multimodal_vision/llama4_example.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,15 @@
33
from transformers import Llama4ForConditionalGeneration, Llama4Processor
44

55
from llmcompressor import oneshot
6-
from llmcompressor.modeling import replace_modules_for_calibration
76
from llmcompressor.modifiers.quantization import GPTQModifier
87

98
# Select model and load it.
109
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
1110
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
1211
processor = Llama4Processor.from_pretrained(model_id)
13-
# We update `Llama4TextMoe` modules with custom `SequentialLlama4TextMoe`.
14-
# This change allows compatibility with vllm.
15-
# To apply your own custom module for experimentation, consider updating
16-
# `SequentialLlama4TextMoe` under llmcompressor/modeling/llama4.py
17-
model = replace_modules_for_calibration(model)
12+
# MoE calibration is now handled automatically by the pipeline.
13+
# The `SequentialLlama4TextMoe` modules will be applied during calibration
14+
# to enable proper expert calibration and vLLM compatibility.
1815

1916
DATASET_ID = "neuralmagic/calibration"
2017
NUM_CALIBRATION_SAMPLES = 512

examples/quantization_w4a4_fp4/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,11 +84,11 @@ We have successfully created an `nvfp4` model!
8484

8585
# Quantizing MoEs
8686

87-
To quantize MoEs, a few additional steps are required. An example quantizing Llama4 can be found under `llama4_example.py`. Here, we replace all `Llama4TextMoe` modules by calling `replace_modules_for_calibration`. This replacement allows us to:
87+
To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under `llama4_example.py`. The pipeline automatically applies the appropriate MoE calibration context which:
8888

89-
1. Linearize the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
90-
2. Ensure experts are quantized correctly as not all experts are activated during calibration
89+
1. Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
90+
2. Ensures experts are quantized correctly as not all experts are activated during calibration
9191

92-
Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model does not require additional linearization as required by the Llama4 model. However, similar to Llama4, in order to ensure the experts are quantized correctly, we can pass in `calibrate_moe_context` which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
92+
Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model uses contextual MoE calibration which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
9393

9494

examples/quantization_w4a4_fp4/llama4_example.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,15 @@
33
from transformers import Llama4ForConditionalGeneration, Llama4Processor
44

55
from llmcompressor import oneshot
6-
from llmcompressor.modeling import replace_modules_for_calibration
76
from llmcompressor.modifiers.quantization import QuantizationModifier
87

98
# Select model and load it.
109
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
1110
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
1211
processor = Llama4Processor.from_pretrained(model_id)
13-
# We update `Llama4TextMoe` modules with custom `SequentialLlama4TextMoe`.
14-
# This change allows compatibility with vllm.
15-
# To apply your own custom module for experimentation, consider updating
16-
# `SequentialLlama4TextMoe` under llmcompressor/modeling/llama4.py
17-
model = replace_modules_for_calibration(model)
12+
# MoE calibration is now handled automatically by the pipeline.
13+
# The `SequentialLlama4TextMoe` modules will be applied during calibration
14+
# to enable proper expert calibration and vLLM compatibility.
1815

1916
DATASET_ID = "neuralmagic/calibration"
2017
NUM_CALIBRATION_SAMPLES = 20

examples/quantizing_moe/deepseek_r1_example.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
33

44
from llmcompressor import oneshot
5-
from llmcompressor.modeling import replace_modules_for_calibration
65
from llmcompressor.modifiers.quantization import GPTQModifier
76

87
# Select model and load it.
@@ -20,7 +19,9 @@
2019
model_id, torch_dtype="auto", config=config
2120
)
2221
tokenizer = AutoTokenizer.from_pretrained(model_id)
23-
model = replace_modules_for_calibration(model)
22+
# MoE calibration is now handled automatically by the pipeline.
23+
# The `DeepseekV3MoECalibrate` modules will be applied during calibration
24+
# to enable proper expert calibration.
2425

2526
# Select calibration dataset.
2627
DATASET_ID = "HuggingFaceH4/ultrachat_200k"

0 commit comments

Comments
 (0)