Skip to content

Commit ec07a83

Browse files
authored
[bug] fix block quantization example (#1777)
SUMMARY: Added dispatch for generation TEST PLAN: ``` python3 examples/quantization_w8a8_fp8/fp8_block_example.py Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [02:18<00:00, 8.67s/it] 2025-08-26T12:27:52.974527+0000 | reset | INFO - Compression lifecycle reset 2025-08-26T12:27:53.024383+0000 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/26-08-2025_12.27.53.log 2025-08-26T12:27:53.024771+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-08-26T12:27:55.045066+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-08-26T12:27:55.045382+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Some parameters are on the meta device because they were offloaded to the cpu. 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49975/49975 [00:00<00:00, 568185.21it/s] Calibrating weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49975/49975 [06:36<00:00, 126.20it/s] 2025-08-26T12:35:34.595007+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-08-26T12:35:42.534632+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` ========== SAMPLE GENERATION ============== Some parameters are on the meta device because they were offloaded to the cpu. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Hello my name is Lillie and I'm a student in the 7th grade. I have a math problem ========================================== 2025-08-26T12:36:53.305881+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. ``` Signed-off-by: shanjiaz <[email protected]>
1 parent b26703f commit ec07a83

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

examples/quantization_w8a8_fp8/fp8_block_example.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from llmcompressor import oneshot
44
from llmcompressor.modifiers.quantization import QuantizationModifier
5+
from llmcompressor.utils import dispatch_for_generation
56

67
MODEL_ID = "Qwen/Qwen3-30B-A3B"
78

@@ -26,6 +27,7 @@
2627

2728
# Confirm generations of the quantized model look sane.
2829
print("========== SAMPLE GENERATION ==============")
30+
dispatch_for_generation(model)
2931
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
3032
output = model.generate(input_ids, max_new_tokens=20)
3133
print(tokenizer.decode(output[0]))

0 commit comments

Comments
 (0)