Skip to content

Host memory OOM Killed when quantizing DeepSeek-V3 AWQ #1684

@ashgold

Description

@ashgold

Describe the bug
Hello. when I tried DeepSeek-V3 AWQ Quantization,
it was killed during the 48th layer smoothing due to Host memory OOM.
(1xH100, 72 cpu cores, 1Ti memory)
Is there any way to reduce host memory while AWQ quantization process?

I performed quantization based on the following PR.
#1619
@cjackal (I apologize if my mention of this surprised you.)

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Ubuntu 22.04
  2. Python version [e.g. 3.7]: 3.12
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.6.1.dev42+g92cdf630
  4. ML framework version(s) [e.g. torch 2.3.1]: torch 2.7.1
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
    accelerate 1.9.0
    aiohappyeyeballs 2.6.1
    aiohttp 3.12.14
    aiosignal 1.4.0
    annotated-types 0.7.0
    attrs 25.3.0
    certifi 2025.7.14
    charset-normalizer 3.4.2
    compressed-tensors 0.10.3a20250724
    datasets 4.0.0
    dill 0.3.8
    filelock 3.18.0
    frozendict 2.4.6
    frozenlist 1.7.0
    fsspec 2025.3.0
    hf-xet 1.1.5
    huggingface-hub 0.34.1
    idna 3.10
    Jinja2 3.1.6
    llmcompressor 0.6.1.dev42+g92cdf630
    loguru 0.7.3
    MarkupSafe 3.0.2
    mpmath 1.3.0
    multidict 6.6.3
    multiprocess 0.70.16
    networkx 3.5
    numpy 2.2.6
    nvidia-cublas-cu12 12.6.4.1
    nvidia-cuda-cupti-cu12 12.6.80
    nvidia-cuda-nvrtc-cu12 12.6.77
    nvidia-cuda-runtime-cu12 12.6.77
    nvidia-cudnn-cu12 9.5.1.17
    nvidia-cufft-cu12 11.3.0.4
    nvidia-cufile-cu12 1.11.1.6
    nvidia-curand-cu12 10.3.7.77
    nvidia-cusolver-cu12 11.7.1.2
    nvidia-cusparse-cu12 12.5.4.2
    nvidia-cusparselt-cu12 0.6.3
    nvidia-ml-py 12.575.51
    nvidia-nccl-cu12 2.26.2
    nvidia-nvjitlink-cu12 12.6.85
    nvidia-nvtx-cu12 12.6.77
    nvitop 1.5.2
    packaging 25.0
    pandas 2.3.1
    pillow 11.3.0
    pip 24.3.1
    propcache 0.3.2
    psutil 7.0.0
    pyarrow 21.0.0
    pydantic 2.11.7
    pydantic_core 2.33.2
    pynvml 12.0.0
    python-dateutil 2.9.0.post0
    pytz 2025.2
    PyYAML 6.0.2
    regex 2024.11.6
    requests 2.32.4
    safetensors 0.5.3
    setuptools 80.9.0
    six 1.17.0
    sympy 1.14.0
    tokenizers 0.21.2
    torch 2.7.1
    tqdm 4.67.1
    transformers 4.54.0
    triton 3.3.1
    typing_extensions 4.14.1
    typing-inspection 0.4.1
    tzdata 2025.2
    urllib3 2.5.0
    xxhash 3.5.0
    yarl 1.20.1
  6. Other relevant environment information [e.g. hardware, CUDA version]:
    NVIDIA-SMI 535.183.06
    Driver Version: 535.183.06
    CUDA Version: 12.4
    To Reproduce
    Exact steps to reproduce the behavior: quantizing AWQ quantization on DeepSeek-V3.

Errors
Host Memory OOM Killed without any additional messages

Additional context

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions