[bug] fix block quantization example (#1777)

shanjiaz · web-flow · commit ec07a8318c12 · 2025-08-26T10:20:32.000-05:00
SUMMARY:
Added dispatch for generation


TEST PLAN:
```
python3 examples/quantization_w8a8_fp8/fp8_block_example.py 
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [02:18&lt;00:00,  8.67s/it]
2025-08-26T12:27:52.974527+0000 | reset | INFO - Compression lifecycle reset
2025-08-26T12:27:53.024383+0000 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/26-08-2025_12.27.53.log
2025-08-26T12:27:53.024771+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-08-26T12:27:55.045066+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-08-26T12:27:55.045382+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier`
Some parameters are on the meta device because they were offloaded to the cpu.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49975/49975 [00:00&lt;00:00, 568185.21it/s]
Calibrating weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49975/49975 [06:36&lt;00:00, 126.20it/s]
2025-08-26T12:35:34.595007+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2025-08-26T12:35:42.534632+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)`
========== SAMPLE GENERATION ==============
Some parameters are on the meta device because they were offloaded to the cpu.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Hello my name is Lillie and I'm a student in the 7th grade. I have a math problem
==========================================
2025-08-26T12:36:53.305881+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
```

Signed-off-by: shanjiaz &lt;zsjwpianpian@gmail.com&gt;
diff --git a/examples/quantization_w8a8_fp8/fp8_block_example.py b/examples/quantization_w8a8_fp8/fp8_block_example.py
@@ -2,6 +2,7 @@
 
 from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.utils import dispatch_for_generation
 
 MODEL_ID = "Qwen/Qwen3-30B-A3B"
 
@@ -26,6 +27,7 @@
 
 # Confirm generations of the quantized model look sane.
 print("========== SAMPLE GENERATION ==============")
+dispatch_for_generation(model)
 input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
 output = model.generate(input_ids, max_new_tokens=20)
 print(tokenizer.decode(output[0]))