SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context` by kylesayrs · Pull Request #1114 · vllm-project/llm-compressor

kylesayrs · 2025-01-29T04:01:34Z

Purpose

Fixes SmoothQuant failing on multi-gpus #1081
Fixes Error when quantizing LLama 3.3 70b to FP8 #963
There's really no explanation online as to why the torch.cuda.empty_cache() kernel sometimes fails to launch. Given that empty_cache does not actually free memory that wouldn't have already been freed by the python garbage collector + pytorch caching allocator, it should be safe to remove this call.

Changes

Remove torch.cuda.empty_cache() in run_calibration_forward, which only affects smoothquant and quantization modifier (sparsegpt and wanda will soon use sequential pipelines instead)
Use calibration_forward_context in smoothquant and quantization modifier
Remove use of torch.cuda.empty_cache() by smoothquant modiifier

Testing

Performed memory analysis with and without torch.cuda.empty_cache and calibration_forward_context independently

Smooth Quant

Quantization Modifier

It was also found that removing the empty_cache calls in between each operation reduced the runtime of Quantization Modifier on llama3-8B by 78%

Before

512/512 [03:18<00:00,  2.58it/s]
Duration: 199.38174653053284

After

512/512 [00:42<00:00, 11.91it/s]
Duration: 44.374401807785034

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

github-actions · 2025-01-29T04:01:44Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

rahul-tuli

Did we run smoothquant with this change? I have certainly come across cases where we run into OOM without this line (even though I know this shouldn't alleviate the issue), I also saw that error go away when CUDA_LAUNCH_BLOCKING env variable was set. I'm good with this change as long as you've verified a smoothquant run! Thanks for investigation

kylesayrs · 2025-01-29T17:00:55Z

@rahul-tuli That's a good enough reason to wait until some regression tests are finished. We should figure out why OOM occurs and potentially add that to the device map/ fix memory leaks

dsikka

This method is used by more than just smoothquant so we would need to do regression testing for modifiers outside of just smoothquant before making this change, if we’re seeing the same effects as what Rahul has described. I would update the PR title/description to reflect this.

smootbquant has its own empty cache call which is not targeted by this PR

kylesayrs · 2025-01-30T14:20:46Z

Yep, the title is a bit of a misnomer is more in reference to "fixing smoothquant" rather than smoothquant-specific implementation

kylesayrs · 2025-02-05T19:33:34Z

Here's the results of memory profiling SmoothQuant under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048).

graph

![20c0e104-2353-4a09-9556-f953075205d2](https://github.com/user-attachments/assets/a6727da5-8350-449b-82b6-eff8f6d3d592)

Standard w/ calib_context: 10047127552
Smoothquant: 10584129536
SQ w/o empty_cache: 10584129536
SQ w/o empty_cache, w/ calib_context: 10047258624

Standard w/ calib_context

# standard forward pass
with calibration_forward_context(model):
    for batch in tqdm.tqdm(dataloader):
        model(**batch)

kylesayrs · 2025-02-05T19:48:44Z

Here's the results of memory profiling Quantization Modifier under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048).

graph

![0a0451e2-108e-40fb-be5c-e9619928ab67](https://github.com/user-attachments/assets/325c2124-734f-40eb-ac3b-77debf45389e)

Standard w/ calib_context: 10047127552
Quantization Modifier: 11243553901
QM w/o empty_cache: 11243553901
QM w/o empty_cache, w/ calib_context: 10047128685

Standard w/ calib_context

# standard forward pass
with calibration_forward_context(model):
    for batch in tqdm.tqdm(dataloader):
        model(**batch)

kylesayrs · 2025-02-05T19:51:10Z

From this brief analysis, I believe it's safe to conclude that

Removing torch.cuda.empty_cache does not lead to increased peak memory usage
Adding calibration_forward_context does decrease peak memory usage (specifically due to having disabled the HF KV cache)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

dsikka

can you resolve conflicts

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-02-06T19:22:11Z

Not sure why ruff is acting up on that test case, but this PR now fixes it

tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py

## Purpose ## * Revert the behavior regression introduced as a result of #1114 * When calibrating a model using the `QuantizationModifier`, quantization should be enabled when calibrating ## Changes ## * Remove "disabling quantization" from the calibration forward pass * Add "disabling quantization" to the sequential pipelines in order to continue to disable quantization during calibration for GPTQ and SGPT * When [calibration pipelines become shared between modifiers](#1279), the decision of whether to disabling quantization during calibration will have to be moved to the calibration pipelines themselves. Some work needs to be done to demonstrate that GPTQ and SGPT do not suffer accuracy regression from enabling activation quantization during calibration (in theory, the change should increase accuracy) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove empty cache call

b9bd970

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs self-assigned this Jan 29, 2025

kylesayrs added the ready When a PR is ready for review label Jan 29, 2025

rahul-tuli previously approved these changes Jan 29, 2025

View reviewed changes

kylesayrs marked this pull request as draft January 29, 2025 17:01

kylesayrs removed the ready When a PR is ready for review label Jan 29, 2025

dsikka requested changes Jan 30, 2025

View reviewed changes

kylesayrs changed the title ~~[Smoothquant] Remove empty cache call~~ Remove empty cache call from calibration forward Jan 30, 2025

use calib forward context

aa7496d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed rahul-tuli’s stale review via aa7496d February 5, 2025 19:54

kylesayrs requested review from dsikka and rahul-tuli February 5, 2025 19:58

kylesayrs added the ready When a PR is ready for review label Feb 5, 2025

kylesayrs marked this pull request as ready for review February 5, 2025 19:58

Merge branch 'main' into kylesayrs/remove_empty_cache

88643cf

kylesayrs changed the title ~~Remove empty cache call from calibration forward~~ SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context Feb 5, 2025

dsikka reviewed Feb 6, 2025

View reviewed changes

Merge remote-tracking branch 'origin' into kylesayrs/remove_empty_cache

42f11f5

kylesayrs mentioned this pull request Feb 6, 2025

SmoothQuant failing on multi-gpus #1081

Closed

fix formatting

fc97920

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

rahul-tuli approved these changes Feb 6, 2025

View reviewed changes

kylesayrs requested a review from dsikka February 6, 2025 23:08

Merge branch 'main' into kylesayrs/remove_empty_cache

ff7f67f

dsikka approved these changes Feb 7, 2025

View reviewed changes

tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py Show resolved Hide resolved

dsikka enabled auto-merge (squash) February 7, 2025 22:51

dsikka merged commit 807e8cf into main Feb 8, 2025
7 checks passed

dsikka deleted the kylesayrs/remove_empty_cache branch February 8, 2025 00:00

kylesayrs mentioned this pull request Mar 31, 2025

Keep quantization enabled during calibration #1299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context`#1114

SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context`#1114
dsikka merged 6 commits intomainfrom
kylesayrs/remove_empty_cache

kylesayrs commented Jan 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 29, 2025

Uh oh!

rahul-tuli left a comment

Uh oh!

kylesayrs commented Jan 29, 2025

Uh oh!

dsikka left a comment

Uh oh!

kylesayrs commented Jan 30, 2025

Uh oh!

kylesayrs commented Feb 5, 2025 •

edited

Loading

Uh oh!

kylesayrs commented Feb 5, 2025 •

edited

Loading

Uh oh!

kylesayrs commented Feb 5, 2025

Uh oh!

dsikka left a comment

Uh oh!

kylesayrs commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kylesayrs commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Smooth Quant

Quantization Modifier

Uh oh!

github-actions bot commented Jan 29, 2025

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Jan 29, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Jan 30, 2025

Uh oh!

kylesayrs commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Feb 5, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs commented Jan 29, 2025 •

edited

Loading

kylesayrs commented Feb 5, 2025 •

edited

Loading

kylesayrs commented Feb 5, 2025 •

edited

Loading