[MoE] Add conditional expert calibration #1701

dichn · 2025-08-01T08:59:30Z

Change Purpose:

Improve MoE calibration support by adding configuration-based expert execution

Change Details:

Create class CalibrationConfig to standalone llmcompressor/modeling/config.py
Add conditional expert execution based on: - moe_calibrate_all_experts: If True, all experts run for every token; If False, only routed experts are run - moe_calibrate_gated_acts: If True, routed experts contribute final output; If False, expert activations are computed but excluded from the final output
Add unit test to verify all experts are triggered during MoE calibration

gemini-code-assist

Summary of Changes

Hello @dichn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of Mixture-of-Experts (MoE) model calibration by introducing configurable control over expert execution and output contribution. My changes allow for more precise calibration strategies, enabling scenarios where all experts are evaluated regardless of routing, or where expert activations are computed without affecting the final model output.

Highlights

New Configuration Class: I've introduced a new CalibrationConfig class in llmcompressor/modeling/config.py to centralize and manage parameters for Mixture-of-Experts (MoE) calibration. This class includes moe_calibrate_all_experts and moe_calibrate_gated_acts boolean flags, with validation to prevent unsupported configurations.
Conditional Expert Execution: I've implemented conditional expert execution logic within the MoE layers for DeepseekV3, Llama4, and Qwen3 models. This allows for flexible control during calibration: moe_calibrate_all_experts ensures all experts run their forward pass, while moe_calibrate_gated_acts determines if their outputs contribute to the final hidden state.
Integration with Model Preparation: The replace_modules_for_calibration and moe_calibration_context functions in llmcompressor/modeling/prepare.py have been updated to accept and pass the new calibration configuration. This ensures that the desired calibration behavior is applied when MoE modules are replaced or temporarily modified.
Unit Test Coverage: I've added new unit tests for DeepseekV3, Llama4, and Qwen3 MoE layers to verify that all experts are triggered when moe_calibrate_all_experts is enabled and moe_calibrate_gated_acts is disabled, ensuring the intended calibration behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-08-01T09:00:45Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request introduces conditional expert calibration for Mixture-of-Experts (MoE) models, controlled by a new CalibrationConfig. The changes allow for more flexible calibration setups. A critical issue was identified where the moe_calibrate_gated_acts flag was not correctly implemented in the model forward passes, leading to incorrect behavior when set to False. A suggestion was also made to improve the clarity of an error message.

src/llmcompressor/modeling/deepseek_v3.py

src/llmcompressor/modeling/llama4.py

src/llmcompressor/modeling/qwen3_moe.py

src/llmcompressor/modeling/config.py

dichn · 2025-08-01T09:02:00Z

@kylesayrs
The new introduced tests/llmcompressor/modeling/test_calib_llama4.py is currently untested due to regional access restrictions to the llama4 repo (I have appended a pytest.skip mark of it) . Could you please help run the test on your end?

dichn · 2025-08-02T02:55:34Z

Re-pushed for

fix the unconditional final output addition
calibration configuration error message improvement
add unit test for calibration configuration false scenario

dsikka

Thank you for the PR! Do you mind listing how you’ve tested the updated examples?

tests/llmcompressor/modeling/test_calib_llama4.py

Change Purpose: - Improve MoE calibration support by adding configuration-based expert execution Change Details: - Create class `CalibrationConfig` to standalone llmcompressor/modeling/config.py - Add conditional expert execution based on: - `moe_calibrate_all_experts`: If True, all experts run for every token; If False, only routed experts are run - `moe_calibrate_gated_acts`: If True, routed experts contribute final output; If False, expert activations are computed but excluded from the final output - Add unit test to verify all experts are triggered during MoE calibration

dichn · 2025-08-04T09:45:18Z

Re-pushed for adding missing Llama4ForConditionalGeneration in test_calib_llama4.py.

Do you mind listing how you’ve tested the updated examples?

For my test plan, I’ve verified the changes using the newly added unit tests. However, due to limited GPU capacity on my local development machine (my laptop), I haven't validated the patch against a full model.

Note on the skipped test test_calib_llama4.py:

This test is currently marked with @pytest.mark.skip because I haven't been able to verify it due to regional access restrictions to LLaMA 4 resources. I’ve asked Kyle to help run the test, and once it passes on his end, the skip mark can be safely removed.

CC: @dsikka @kylesayrs

rahul-tuli

Thank you for the PR, it looks great! I had a couple suggestions:

Documenting the flags in the class itself, so it's clearer to future users
Few naming suggestions (nits)

I'm running the llama4 example at the moment, will update here if it passes!

rahul-tuli · 2025-08-06T13:07:49Z

src/llmcompressor/modeling/config.py

+
+
+class CalibrationConfig(BaseModel):
+    moe_calibrate_all_experts: bool


Could we add more information in this config class around what these flags do for future readers, so it's clear which flag should be set for which mode?

I was thinking something like:

| all_experts | gated_acts | Behavior | |-------------|------------|------------------------------------------------------------------------| | True | True | All experts run, routed experts contribute to output (current default) | | True | False | All experts run for calibration, but outputs ignored | | False | True | Only routed experts run and contribute (standard inference) | | False | False | Invalid configuration (raises error) |

rahul-tuli · 2025-08-06T13:15:00Z

src/llmcompressor/modeling/config.py

+from pydantic import BaseModel, model_validator
+
+
+class CalibrationConfig(BaseModel):


nit: What do you think about renaming to MoECalibrationConfig?

rahul-tuli · 2025-08-06T13:25:45Z

src/llmcompressor/modeling/config.py

+
+class CalibrationConfig(BaseModel):
+    moe_calibrate_all_experts: bool
+    moe_calibrate_gated_acts: bool


nit: Consider renaming to something like use_gated_outputs since The name suggests it's about "calibrating
gated activations" but it actually controls whether expert outputs contribute to the final result.

rahul-tuli · 2025-08-06T13:31:10Z

Update: The llama4 test failed for me locally with:

pytest tests/llmcompressor/modeling/test_calib_llama4.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.10.12, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/rahul/llm-compressor
configfile: pyproject.toml
plugins: rerunfailures-15.1, mock-3.14.1
collected 1 item                                                                                                                       

tests/llmcompressor/modeling/test_calib_llama4.py F                                                                              [100%]

=============================================================== FAILURES ===============================================================
_________________________ test_calib_replace_llama4_moe_all_experts[meta-llama/Llama-4-Scout-17B-16E-Instruct] _________________________

model_stub = 'meta-llama/Llama-4-Scout-17B-16E-Instruct'

    @pytest.mark.parametrize("model_stub", ["meta-llama/Llama-4-Scout-17B-16E-Instruct"])
    def test_calib_replace_llama4_moe_all_experts(model_stub):
        with skip_weights_download(Llama4ForConditionalGeneration):
            model = Llama4ForConditionalGeneration.from_pretrained(
                model_stub, torch_dtype="auto"
            )
    
        replace_modules_for_calibration(
            model, moe_calibrate_gated_acts=False, moe_calibrate_all_experts=True
        )
    
        # Find a Llama4 MoE layer
        moe_layer = None
>       for _, module in model.modules():
E       TypeError: cannot unpack non-iterable Llama4ForConditionalGeneration object

tests/llmcompressor/modeling/test_calib_llama4.py:25: TypeError
-------------------------------------------------------- Captured stdout setup ---------------------------------------------------------
2025-08-06T09:19:24.192277-0400 | reset | INFO - Compression lifecycle reset
--------------------------------------------------------- Captured stderr call ---------------------------------------------------------
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 21.27it/s]
The following layers were not sharded: vision_model.positional_embedding_vlm, vision_model.model.layers.*.self_attn.o_proj.weight, language_model.model.layers.*.input_layernorm.weight, language_model.model.layers.*.self_attn.q_proj.weight, vision_model.model.layers.*.mlp.fc*.bias, vision_model.model.layers.*.self_attn.v_proj.bias, vision_model.patch_embedding.linear.weight, vision_model.layernorm_pre.weight, language_model.model.layers.*.feed_forward.shared_expert.gate_proj.weight, vision_model.layernorm_pre.bias, vision_model.model.layers.*.self_attn.v_proj.weight, language_model.model.embed_tokens.weight, language_model.model.layers.*.feed_forward.shared_expert.up_proj.weight, vision_model.model.layers.*.self_attn.k_proj.weight, vision_model.model.layers.*.input_layernorm.weight, vision_model.model.layers.*.post_attention_layernorm.bias, vision_model.vision_adapter.mlp.fc*.weight, language_model.model.layers.*.feed_forward.shared_expert.down_proj.weight, vision_model.layernorm_post.bias, language_model.model.layers.*.feed_forward.router.weight, language_model.model.layers.*.feed_forward.experts.down_proj, vision_model.model.layers.*.self_attn.q_proj.bias, vision_model.model.layers.*.self_attn.o_proj.bias, vision_model.model.layers.*.input_layernorm.bias, vision_model.model.layers.*.self_attn.k_proj.bias, language_model.model.layers.*.post_attention_layernorm.weight, vision_model.model.layers.*.post_attention_layernorm.weight, vision_model.model.layers.*.mlp.fc*.weight, vision_model.layernorm_post.weight, vision_model.model.layers.*.self_attn.q_proj.weight, language_model.lm_head.weight, language_model.model.layers.*.self_attn.v_proj.weight, multi_modal_projector.linear_*.weight, language_model.model.layers.*.self_attn.o_proj.weight, vision_model.class_embedding, language_model.model.layers.*.self_attn.k_proj.weight, language_model.model.norm.weight, language_model.model.layers.*.feed_forward.experts.gate_up_proj
------------------------------------------------------- Captured stdout teardown -------------------------------------------------------
2025-08-06T09:23:08.745259-0400 | reset | INFO - Compression lifecycle reset
======================================================= short test summary info ========================================================
FAILED tests/llmcompressor/modeling/test_calib_llama4.py::test_calib_replace_llama4_moe_all_experts[meta-llama/Llama-4-Scout-17B-16E-Instruct] - TypeError: cannot unpack non-iterable Llama4ForConditionalGeneration object
==================================================== 1 failed in 234.66s (0:03:54) =====================================================

I will take a look in sometime!

dsikka

How will the user configure these arguments?

Currently, for qwen3, we pass in calibrate_moe_context=True for nvfp4:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/qwen_30b_a3b.py#L70

This allows us to temporarily update the moe blocks with the blocks defined in modeling/qwen3_moe.py - is the plan to keep this argument?

dichn · 2025-08-07T08:05:25Z

As noted by @kylesayrs, this PR aligns with a spec change that drops moe_calibrate_gated_acts in favor of supporting only moe_calibrate_all_experts, simplifying the implementation.
Marking this PR as a draft again.
CC: @rahul-tuli @dsikka (Thank you for your review.)

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

dsikka requested review from kylesayrs and dsikka August 1, 2025 17:45

dichn force-pushed the calib branch from 61c1efc to 874e76f Compare August 2, 2025 02:52

dsikka reviewed Aug 2, 2025

View reviewed changes

kylesayrs reviewed Aug 2, 2025

View reviewed changes

tests/llmcompressor/modeling/test_calib_llama4.py Outdated Show resolved Hide resolved

dichn force-pushed the calib branch from 874e76f to ac03be9 Compare August 4, 2025 09:27

dichn requested review from kylesayrs and dsikka August 4, 2025 13:17

rahul-tuli reviewed Aug 6, 2025

View reviewed changes

dsikka reviewed Aug 6, 2025

View reviewed changes

dichn marked this pull request as draft August 7, 2025 08:08



		class CalibrationConfig(BaseModel):
		moe_calibrate_all_experts: bool

		from pydantic import BaseModel, model_validator


		class CalibrationConfig(BaseModel):

[MoE] Add conditional expert calibration #1701

Are you sure you want to change the base?

[MoE] Add conditional expert calibration #1701

Conversation

dichn commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dichn commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dichn commented Aug 2, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dichn commented Aug 4, 2025

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Aug 6, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

dichn commented Aug 7, 2025

Uh oh!

Uh oh!

dichn commented Aug 1, 2025 •

edited

Loading