Commit 5d037f7
refactor(smoothquant): address HDCharles review comments
- Move is_distributed() guard to _apply_smoothing() hotpath so
_reduce_activation_scales() only handles the distributed case,
improving readability for single-GPU readers
- Remove redundant unit tests (subsumed by 2n_calls test; empty
scales test unnecessary per reviewer feedback)
- Remove test_smoothquant_ddp_script_runs_cleanly (too expensive for CI)
- Switch integration test model to nm-testing/tinysmokellama-3.2
(CI-friendly tiny model per HDCharles suggestion)
- Switch DDP example model to Qwen/Qwen2-7B-Instruct (DDP more
meaningful for larger models)
- Fix --nproc arg conflict with torchrun, rename to --num_gpus
- Add benchmark_smoothquant_ddp.py for reproducing speedup numbers
Distributed speedup on 4x V100 32GB (Qwen2-7B-Instruct, 512 samples):
1 GPU: 94.1 min | 8.93 GB peak mem | 1.00x
2 GPU: 58.7 min | 7.06 GB peak mem | 1.60x
4 GPU: 28.7 min | 7.06 GB peak mem | 3.28x
Addresses review comments from HDCharles on PR vllm-project#2471
Signed-off-by: David Zheng <dqzheng1996@gmail.com>1 parent fae564f commit 5d037f7
File tree
3 files changed
+75
-191
lines changed- examples/quantization_w8a8_int8
- src/llmcompressor/modifiers/transform/smoothquant
- tests/llmcompressor/modifiers/transform/smoothquant
3 files changed
+75
-191
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
343 | 343 | | |
344 | 344 | | |
345 | 345 | | |
346 | | - | |
| 346 | + | |
| 347 | + | |
347 | 348 | | |
348 | 349 | | |
349 | 350 | | |
| |||
0 commit comments