-
Notifications
You must be signed in to change notification settings - Fork 464
SmoothQuant failing on multi-gpus #1081
Description
Describe the bug
I get a CUDA error when running SmoothQuant on multiple GPUs. Tried on different CUDA versions without success. The error seems to come from a synchronization failure during an empty_cuda() call that is used after every forward pass in the calibration step. Moving empty_cuda() to after all batches are processed seems to fix the issue (see branch fix/smoothquant_multigpu)
Expected behavior
Run SmoothQuant on multiple gpus w/o errors.
Environment
Include all relevant environment information:
- OS: Ubuntu 20.04:
- Python version: 3.10.12
- LLM Compressor version or commit hash: b175943
- ML framework version(s): torch 2.5.1
- Other Python package versions:
- Other relevant environment information:
- CUDA driver: 12.5
- GPU driver: 555.42.02
- Python CUDA libraries:
- nvidia-cublas-cu12==12.4.5.8
- nvidia-cuda-cupti-cu12==12.4.127
- nvidia-cuda-nvrtc-cu12==12.4.127
- nvidia-cuda-runtime-cu12==12.4.127
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.2.1.3
- nvidia-curand-cu12==10.3.5.147
- nvidia-cusolver-cu12==11.6.1.9
- nvidia-cusparse-cu12==12.3.1.170
- nvidia-nccl-cu12==2.21.5
- nvidia-nvjitlink-cu12==12.4.127
- nvidia-nvtx-cu12==12.4.127
To Reproduce
Exact steps to reproduce the behavior:
Run oneshot on Llama-3.3-70B-Instruct with 1024 samples of the LLM_compression_calibration dataset, with sequence length limited to 8192, on 4 A100 GPUs, using the following recipe:
quant_stage:
quant_modifiers:
SmoothQuantModifier:
smoothing_strength: 0.0
mappings:
- [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"]
- [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
- [["re:.*down_proj"], "re:.*up_proj"]
GPTQModifier:
sequential_update: true
dampening_frac: 0.0
ignore: ["lm_head"]
config_groups:
group_0:
targets: ["Linear"]
weights:
num_bits: 8
type: "int"
symmetric: true
strategy: "channel"
observer: "mse"
input_activations:
num_bits: 8
type: "int"
symmetric: true
strategy: "token"
dynamic: true
observer: "memoryless"
Errors
Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/code/queue_llmcompressor_oneshot.py", line 267, in
oneshot(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 84, in oneshot
main(model_args, data_args, training_args)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 413, in main
stage_runner.one_shot()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/runner.py", line 163, in one_shot
self.trainer.one_shot(calibration_data=calib_data, stage=stage)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 440, in one_shot
apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
return active_session().apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 212, in apply
self.initialize(**kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 158, in initialize
mod_data = self._lifecycle.initialize(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
data = mod.initialize(state=self.state, **extras)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
modifier.initialize(state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
initialized = self.on_initialize(state=state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 136, in on_initialize
self._calibrate(state.model, calibration_dataloader)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 253, in _calibrate
run_calibration_forward(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 107, in run_calibration_forward
torch.cuda.empty_cache()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 192, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Additional context
Attached full log file
task_1e407209f55046d399d387a0fd1858bb.log
.