Skip to content

Commit db0dccd

Browse files
rahul-tulidsikka
andauthored
Fix: Improve SmoothQuant Support for Mixture of Experts (MoE) Models (#1455)
The current `SmoothQuant` implementation produces poor-quality outputs when applied to **Mixture of Experts (MoE)** models. Specifically: * The `post_attention_layernorm` is downscaled. * The corresponding `gate` layer is upscaled. * However, the `experts` — which also receive inputs from `post_attention_layernorm` — are **not** upscaled, leading to a mismatch in input scales and degraded performance. ### Solution This PR removes the **gate layer** from `SmoothQuant`'s list of balanced layers. Rationale: * Gate layers are not quantization targets, so they do **not require smoothing**.. ### Evaluation **Command used:** ```bash CUDA_VISIBLE_DEVICES=1,4,5 lm_eval \ --model hf \ --model_args "pretrained=<MODEL_PATH>;,dtype=auto,add_bos_token=True,trust_remote_code=True,parallelize=True" \ --tasks winogrande \ --batch_size auto \ --num_fewshot 5 \ --write_out \ --output_path <OUTPUT_DIR>; \ --show_config ``` </code></pre> <p><strong>Raw results:</strong></p> <pre><code>| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.8248|± |0.0107| # Quantized |winogrande| 1|none | 5|acc |↑ |0.8264|± |0.0106| # Base </code></pre> <p><strong>Summary:</strong></p> Model | Accuracy | StdErr | Few-shot | Task -- | -- | -- | -- | -- Quantized | 82.48% | ±1.07% | 5 | winogrande Base | 82.64% | ±1.06% | 5 | winogrande <pre><code></code></pre></body></html> **Sample Generation Before This PR:** ```bash ========== SAMPLE GENERATION ============== Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation. The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead. ['<|begin▁of▁sentence|>I love quantization because.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\.}~\\'] ========================================== ``` **Sample Generation After This PR:** ```bash ========== SAMPLE GENERATION ============== Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation. The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead. ['<|begin▁of▁sentence|>I love quantization because it’s a way to make a continuous function discrete.\n\nIn this post, I will explain how to quantize a continuous function using a simple method called the midpoint method. This method is a type of quantization technique'] ========================================== ``` **Produced Model:** [HuggingFace: Mixtral-8x7B-Instruct-v0.1-W8A8-updated-smoothquant](https://huggingface.co/nm-testing/Mixtral-8x7B-Instruct-v0.1-W8A8-updated-smoothquant/tree/main/) --- ### Future Work * Extend `SmoothQuant` to support **scaling of expert layers** for even better MoE compatibility. --- Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>
1 parent a6567d7 commit db0dccd

File tree

1 file changed

+0
-6
lines changed
  • src/llmcompressor/modifiers/smoothquant

1 file changed

+0
-6
lines changed

src/llmcompressor/modifiers/smoothquant/utils.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,6 @@
2828
balance_layers=["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"],
2929
smooth_layers="re:.*input_layernorm",
3030
),
31-
LayerMap(
32-
balance_layers=["re:.*gate"], smooth_layers="re:.*post_attention_layernorm"
33-
),
3431
]
3532
BLOOM_SMOOTHQUANT_MAPPINGS: List[LayerMap] = [
3633
LayerMap(
@@ -68,9 +65,6 @@
6865
balance_layers=["re:.*q_proj", "re:.*kv_a_proj_with_mqa"],
6966
smooth_layers="re:.*input_layernorm",
7067
),
71-
LayerMap(
72-
balance_layers=["re:.*gate"], smooth_layers="re:.*post_attention_layernorm"
73-
),
7468
]
7569

7670

0 commit comments

Comments
 (0)