SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context#1114
SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context#1114
torch.cuda.empty_cache, use calibration_forward_context#1114Conversation
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
rahul-tuli
left a comment
There was a problem hiding this comment.
Did we run smoothquant with this change? I have certainly come across cases where we run into OOM without this line (even though I know this shouldn't alleviate the issue), I also saw that error go away when CUDA_LAUNCH_BLOCKING env variable was set. I'm good with this change as long as you've verified a smoothquant run! Thanks for investigation
|
@rahul-tuli That's a good enough reason to wait until some regression tests are finished. We should figure out why OOM occurs and potentially add that to the device map/ fix memory leaks |
dsikka
left a comment
There was a problem hiding this comment.
This method is used by more than just smoothquant so we would need to do regression testing for modifiers outside of just smoothquant before making this change, if we’re seeing the same effects as what Rahul has described. I would update the PR title/description to reflect this.
smootbquant has its own empty cache call which is not targeted by this PR
|
Yep, the title is a bit of a misnomer is more in reference to "fixing smoothquant" rather than smoothquant-specific implementation |
|
Here's the results of memory profiling SmoothQuant under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048). graphStandard w/ calib_context: 10047127552 Standard w/ calib_context# standard forward pass
with calibration_forward_context(model):
for batch in tqdm.tqdm(dataloader):
model(**batch) |
|
Here's the results of memory profiling Quantization Modifier under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048). graphStandard w/ calib_context: 10047127552 Standard w/ calib_context# standard forward pass
with calibration_forward_context(model):
for batch in tqdm.tqdm(dataloader):
model(**batch) |
|
From this brief analysis, I believe it's safe to conclude that
|
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
torch.cuda.empty_cache, use calibration_forward_context
dsikka
left a comment
There was a problem hiding this comment.
can you resolve conflicts
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
Not sure why ruff is acting up on that test case, but this PR now fixes it |
tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py
Show resolved
Hide resolved
## Purpose ## * Revert the behavior regression introduced as a result of #1114 * When calibrating a model using the `QuantizationModifier`, quantization should be enabled when calibrating ## Changes ## * Remove "disabling quantization" from the calibration forward pass * Add "disabling quantization" to the sequential pipelines in order to continue to disable quantization during calibration for GPTQ and SGPT * When [calibration pipelines become shared between modifiers](#1279), the decision of whether to disabling quantization during calibration will have to be moved to the calibration pipelines themselves. Some work needs to be done to demonstrate that GPTQ and SGPT do not suffer accuracy regression from enabling activation quantization during calibration (in theory, the change should increase accuracy) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Purpose
torch.cuda.empty_cache()kernel sometimes fails to launch. Given thatempty_cachedoes not actually free memory that wouldn't have already been freed by the python garbage collector + pytorch caching allocator, it should be safe to remove this call.Changes
torch.cuda.empty_cache()inrun_calibration_forward, which only affects smoothquant and quantization modifier (sparsegpt and wanda will soon use sequential pipelines instead)calibration_forward_contextin smoothquant and quantization modifiertorch.cuda.empty_cache()by smoothquant modiifierTesting
torch.cuda.empty_cacheandcalibration_forward_contextindependentlySmooth Quant
Quantization Modifier
It was also found that removing the
empty_cachecalls in between each operation reduced the runtime of Quantization Modifier on llama3-8B by 78%Before
After