Skip to content

SmoothQuant failing on multi-gpus #1081

@anmarques

Description

@anmarques

Describe the bug
I get a CUDA error when running SmoothQuant on multiple GPUs. Tried on different CUDA versions without success. The error seems to come from a synchronization failure during an empty_cuda() call that is used after every forward pass in the calibration step. Moving empty_cuda() to after all batches are processed seems to fix the issue (see branch fix/smoothquant_multigpu)

Expected behavior
Run SmoothQuant on multiple gpus w/o errors.

Environment
Include all relevant environment information:

  1. OS: Ubuntu 20.04:
  2. Python version: 3.10.12
  3. LLM Compressor version or commit hash: b175943
  4. ML framework version(s): torch 2.5.1
  5. Other Python package versions:
  6. Other relevant environment information:
  • CUDA driver: 12.5
  • GPU driver: 555.42.02
  • Python CUDA libraries:
    • nvidia-cublas-cu12==12.4.5.8
    • nvidia-cuda-cupti-cu12==12.4.127
    • nvidia-cuda-nvrtc-cu12==12.4.127
    • nvidia-cuda-runtime-cu12==12.4.127
    • nvidia-cudnn-cu12==9.1.0.70
    • nvidia-cufft-cu12==11.2.1.3
    • nvidia-curand-cu12==10.3.5.147
    • nvidia-cusolver-cu12==11.6.1.9
    • nvidia-cusparse-cu12==12.3.1.170
    • nvidia-nccl-cu12==2.21.5
    • nvidia-nvjitlink-cu12==12.4.127
    • nvidia-nvtx-cu12==12.4.127

To Reproduce
Exact steps to reproduce the behavior:
Run oneshot on Llama-3.3-70B-Instruct with 1024 samples of the LLM_compression_calibration dataset, with sequence length limited to 8192, on 4 A100 GPUs, using the following recipe:

quant_stage:
  quant_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.0
      mappings:
        - [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"]
        - [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
        - [["re:.*down_proj"], "re:.*up_proj"]
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.0
      ignore: ["lm_head"]
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "channel"
            observer: "mse"
          input_activations:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "token"
            dynamic: true
            observer: "memoryless"

Errors
Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/code/queue_llmcompressor_oneshot.py", line 267, in
oneshot(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 84, in oneshot
main(model_args, data_args, training_args)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 413, in main
stage_runner.one_shot()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/runner.py", line 163, in one_shot
self.trainer.one_shot(calibration_data=calib_data, stage=stage)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 440, in one_shot
apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
return active_session().apply(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 212, in apply
self.initialize(**kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 158, in initialize
mod_data = self._lifecycle.initialize(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
data = mod.initialize(state=self.state, **extras)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
modifier.initialize(state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
initialized = self.on_initialize(state=state, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 136, in on_initialize
self._calibrate(state.model, calibration_dataloader)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 253, in _calibrate
run_calibration_forward(
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 107, in run_calibration_forward
torch.cuda.empty_cache()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 192, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Additional context
Attached full log file

task_1e407209f55046d399d387a0fd1858bb.log

.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions