Fix: Improve SmoothQuant Support for Mixture of Experts (MoE) Models (#1455)

rahul-tuli · dsikka · web-flow · commit db0dccdf047a · 2025-05-21T21:46:47.000Z
The current `SmoothQuant` implementation produces poor-quality outputs when applied to **Mixture of Experts (MoE)** models. Specifically: * The `post_attention_layernorm` is downscaled. * The corresponding `gate` layer is upscaled. * However, the `experts` — which also receive inputs from `post_attention_layernorm` — are **not** upscaled, leading to a mismatch in input scales and degraded performance. ### Solution This PR removes the **gate layer** from `SmoothQuant`'s list of balanced layers. Rationale: * Gate layers are not quantization targets, so they do **not require smoothing**.. ### Evaluation **Command used:** ```bash CUDA_VISIBLE_DEVICES=1,4,5 lm_eval \ --model hf \ --model_args "pretrained=<MODEL_PATH>;,dtype=auto,add_bos_token=True,trust_remote_code=True,parallelize=True" \ --tasks winogrande \ --batch_size auto \ --num_fewshot 5 \ --write_out \ --output_path <OUTPUT_DIR>; \ --show_config ``` </code></pre> <p><strong>Raw results:</strong></p> <pre><code>| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.8248|± |0.0107| # Quantized |winogrande| 1|none | 5|acc |↑ |0.8264|± |0.0106| # Base </code></pre> <p><strong>Summary:</strong></p> Model | Accuracy | StdErr | Few-shot | Task -- | -- | -- | -- | -- Quantized | 82.48% | ±1.07% | 5 | winogrande Base | 82.64% | ±1.06% | 5 | winogrande <pre><code></code></pre></body></html> **Sample Generation Before This PR:** ```bash ========== SAMPLE GENERATION ============== Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation. The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead. ['<｜begin▁of▁sentence｜>I love quantization because.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\'] ========================================== ``` **Sample Generation After This PR:** ```bash ========== SAMPLE GENERATION ============== Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation. The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead. ['<｜begin▁of▁sentence｜>I love quantization because it’s a way to make a continuous function discrete.\n\nIn this post, I will explain how to quantize a continuous function using a simple method called the midpoint method. This method is a type of quantization technique'] ========================================== ``` **Produced Model:** [HuggingFace: Mixtral-8x7B-Instruct-v0.1-W8A8-updated-smoothquant](https://huggingface.co/nm-testing/Mixtral-8x7B-Instruct-v0.1-W8A8-updated-smoothquant/tree/main/) --- ### Future Work * Extend `SmoothQuant` to support **scaling of expert layers** for even better MoE compatibility. --- Signed-off-by: Rahul Tuli <rtuli@redhat.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
diff --git a/src/llmcompressor/modifiers/smoothquant/utils.py b/src/llmcompressor/modifiers/smoothquant/utils.py
@@ -28,9 +28,6 @@
         balance_layers=["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"],
         smooth_layers="re:.*input_layernorm",
     ),
-    LayerMap(
-        balance_layers=["re:.*gate"], smooth_layers="re:.*post_attention_layernorm"
-    ),
 ]
 BLOOM_SMOOTHQUANT_MAPPINGS: List[LayerMap] = [
     LayerMap(
@@ -68,9 +65,6 @@
         balance_layers=["re:.*q_proj", "re:.*kv_a_proj_with_mqa"],
         smooth_layers="re:.*input_layernorm",
     ),
-    LayerMap(
-        balance_layers=["re:.*gate"], smooth_layers="re:.*post_attention_layernorm"
-    ),
 ]