-
Notifications
You must be signed in to change notification settings - Fork 300
[Fix] SmoothQuant MoE Support: Smooth All Experts, Not Just expert.0 #2084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Add comprehensive unit tests to verify that SmoothQuant correctly handles Mixture of Experts (MoE) models by smoothing all experts, not just the first one. Tests added: - test_moe_all_experts_smoothed: Verifies all 8 experts in a single MoE layer are included in balance_layers - test_moe_multiple_layers_all_experts_smoothed: Verifies correct behavior across multiple transformer layers with 4 experts each These tests currently fail with the existing implementation, which only matches the first expert due to get_matching_layer() returning a single match instead of iterating over all matches. Signed-off-by: Rahul-Tuli <[email protected]>
Replace get_matching_layer() with match_named_modules() to iterate over ALL matched layers instead of returning only the first match. This fixes a critical bug where only expert.0 was smoothed in MoE models, leaving all other experts unsmoothed and causing severe accuracy degradation. Changes: - Use match_named_modules from compressed_tensors.utils to iterate over all matching modules - Search for balance layers within the parent module scope for better locality - Follow the same pattern already proven to work in AWQModifier This fix ensures all experts in MoE models (Mixtral, Qwen3, Phi, DeepSeek) are properly smoothed during quantization. Signed-off-by: Rahul-Tuli <[email protected]>
d7efc95 to
5c4d2d6
Compare
fynnsu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
| for balance_regex in to_balance: | ||
| for _, balance_layer in match_named_modules( | ||
| smooth_parent, [balance_regex], self.ignore | ||
| ): | ||
| balance_layers.append(balance_layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just do
for _, balance_layer in match_named_modules(
smooth_parent, to_balance, self.ignore
):
balance_layers.append(balance_layer)since match_named_modules already contains a loop over targets (link).
or maybe even
balance_layers = [balance_layer for _, balance_layer in match_named_modules(smooth_parent, to_balance, self.ignore)]| smooth_parent = ( | ||
| get_layer_by_name(smooth_parent_name, model) | ||
| if smooth_parent_name | ||
| else model | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is totally fine as is but it might be cleaner if we updated get_layer_by_name so that get_layer_by_name("", model) returns model.
This could be done by adding an if statement before this line to handle the special case.
That would allow us to remove the if smooth_parent_name else model logic.
HDCharles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me
| # one layer to smooth | ||
| mapping = SmoothQuantMapping( | ||
| layer_name, smooth_layer, balance_layers | ||
| to_smooth_list = [to_smooth] if isinstance(to_smooth, str) else to_smooth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm working on a matching util that will make this mapping-matching easier, fyi
Problem
SmoothQuant only smoothed the first expert (
expert.0) in Mixture of Experts (MoE) models, leaving all other experts unsmoothed. This caused severe accuracy degradation for MoE modelsRoot Cause
The
_resolve_mappings()method inSmoothQuantModifierusedget_matching_layer(), which returns only the first regex match instead of iterating over all matches. For MoE models with regex patterns like"re:.*experts.*w1", this meant onlyexpert.0.w1was smoothed while experts 1-N were ignored.Solution
Replace
get_matching_layer()withmatch_named_modules()fromcompressed_tensors.utilsto iterate over ALL matched layers. This follows the same proven pattern used inAWQModifier.Key Changes
match_named_modulesfromcompressed_tensors.utils_resolve_mappings(): Iterate over all matched layers instead of just the firstTests Added
Added unit tests to encompass the issue to verify MoE support, these tests fail on main but pass with current diff:
1.
test_moe_all_experts_smoothedVerifies all 8 experts in a single MoE layer are included in
balance_layers:2.
test_moe_multiple_layers_all_experts_smoothedVerifies correct behavior across multiple transformer layers:
Test Results
All tests pass successfully:
$ python -m pytest tests/llmcompressor/modifiers/smoothquant/test_base.py -v test_smooth_quant_is_registered ✅ PASSED test_smooth_quant_defaults ✅ PASSED test_override_defaults ✅ PASSED test_moe_all_experts_smoothed ✅ PASSED test_moe_multiple_layers_all_experts_smoothed ✅ PASSED ========================= 5 passed in 0.41s =========================Before Fix (Tests Failed)
After Fix (Tests Pass)
Related Issues
Fixes the SmoothQuant MoE bug reported in the community discussion about MoE quantization support.