Skip to content

Commit 7a82b9f

Browse files
shanjiazdbarbuzzi
authored andcommitted
pass in tensor_id for calculate_qparam (#1709)
### Issue: FP8_BLOCK quantization produced poor `lm_eval` results due to two issues: 1. **Shared statistics across blocks**: All blocks used the same `tensor_id`, causing incorrect running statistics 2. **MoE gates being quantized**: Critical routing layers were quantized, degrading performance ### Solution - **Fixed block statistics**: Pass unique `tensor_id=f"block_{i}_{j}"` to `calculate_qparams` for each block - **Updated example**: Set proper ignore layers ### Changes - `src/llmcompressor/observers/base.py`: Added unique tensor IDs for block-wise statistics - `examples/quantization_w8a8_fp8/fp8_block_example.py`: Fixed ignore patterns for MoE gates ### Test: Produced models: ``` shanjiaz/Qwen3-30B-A3B-FP8-BLOCK shanjiaz/Qwen3-0.6B-FP8-BLOCK ``` Quantized models now get exact same result as Michael's original ``` lm_eval --model vllm --model_args pretrained=shanjiaz/Qwen3-30B-A3B-FP8-BLOCK --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto vllm (pretrained=shanjiaz/Qwen3-30B-A3B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8324|± |0.0103| | | |strict-match | 5|exact_match|↑ |0.8848|± |0.0088| ``` lm_eval --model vllm --model_args pretrained=shanjiaz/Qwen3-0.6B-FP8-BLOCK --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto vllm (pretrained=shanjiaz/Qwen3-0.6B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3995|± |0.0135| --------- Signed-off-by: shanjiaz <[email protected]> Signed-off-by: Domenic Barbuzzi <[email protected]>
1 parent dbaff79 commit 7a82b9f

File tree

2 files changed

+13
-3
lines changed

2 files changed

+13
-3
lines changed

examples/quantization_w8a8_fp8/fp8_block_example.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from llmcompressor import oneshot
44
from llmcompressor.modifiers.quantization import QuantizationModifier
55

6-
MODEL_ID = "Qwen/Qwen3-0.6B"
6+
MODEL_ID = "Qwen/Qwen3-30B-A3B"
77

88
# Load model.
99
model = AutoModelForCausalLM.from_pretrained(
@@ -16,7 +16,7 @@
1616
# * quantize the weights to fp8 with per channel via ptq
1717
# * quantize the activations to fp8 with dynamic per token
1818
recipe = QuantizationModifier(
19-
targets="Linear", scheme="FP8_BLOCK", ignore=["lm_head"]
19+
targets="Linear", scheme="FP8_BLOCK", ignore=["lm_head", "re:.*mlp.gate$"],
2020
)
2121

2222
# Apply quantization.

src/llmcompressor/observers/base.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,18 @@ def calculate_qparams(
6363
self,
6464
observed: Tensor,
6565
reduce_dims: Optional[Tuple[int]] = None,
66+
tensor_id: Optional[Any] = None,
67+
global_scale: Optional[Tensor] = None,
6668
) -> Tuple[FloatTensor, IntTensor]:
6769
"""
6870
:param observed: observed tensor to calculate quantization parameters for
6971
:param reduce_dims: optional tuple of dimensions to reduce along,
7072
returned scale and zero point will be shaped (1,) along the
7173
reduced dimensions
74+
:param tensor_id: optional id for tracking separate statistics when different
75+
ranges of observed tensors are passed, useful for sharding tensors by
76+
group_size or block quantization
77+
:param global_scale: optional scale to further scale local quantization scales
7278
:return: tuple of scale and zero point derived from the observed tensor
7379
"""
7480
raise NotImplementedError(f"{self.__class__} must implement calculate_qparams")
@@ -233,8 +239,12 @@ def get_qparams(
233239
c0 = j * block_cols
234240
c1 = min((j + 1) * block_cols, cols)
235241
# reduce across both dims to get one scale and zp per block
242+
# Use unique tensor_id for each block to maintain separate stats
243+
block_tensor_id = f"block_{i}_{j}"
236244
scale_bp, zp_bp = self.calculate_qparams(
237-
observed[r0:r1, c0:c1], reduce_dims=(0, 1)
245+
observed[r0:r1, c0:c1],
246+
reduce_dims=(0, 1),
247+
tensor_id=block_tensor_id,
238248
)
239249
self._scale[i, j] = scale_bp
240250
self._zero_point[i, j] = zp_bp

0 commit comments

Comments
 (0)