Skip to content

Commit 9439f18

Browse files
[AWQ] Support accumulation for reduced memory usage (#1435)
> [!NOTE] > @brian-dellabetta updated summary here (took over this PR from @kylesayrs ) ### Summary This update removes the `Catcher` logic ported from AutoAWQ, instead using the SequentialPipeline features and a couple hooks to cache args into module forward passes, as needed to run AWQ. Should cause no significant change in results, but this should be a more accurate implementation because kwargs are cached for each parent layer, rather than re-using those of the first module. Leveraging `IntermediatesCache` for cached values, this also exposes a new `offload_cache` bool value on `AWQModifier`, if set to True it will offload cached values at the expense of slower runtime. With `meta-llama/Llama-2-7b-hf`, offloading decreases max GPU memory from ~27GB to ~20GB, at the cost of `apply_smoothing` taking ~17 seconds per iteration as opposed to ~5 seconds. Because of this, I am leaving the default to not offload, just noting in the docstring to toggle this if users encounter OOM errors. ### Test Plan Confirmed that these changes don't significantly alter PPL scores relative to current implementation on main. - `meta-llama/Llama-3.2-3B-Instruct` - PPL 14.1523 on main, 14.081 on this branch - `Qwen/Qwen2.5-7B-Instruct` - PPL 10.411 on main, 10.736 on this branch - `meta-llama/Llama-2-7b-hf` - PPL 9.5075 on main, 9.503 on this branch --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]>
1 parent 91b15d2 commit 9439f18

File tree

3 files changed

+147
-223
lines changed

3 files changed

+147
-223
lines changed

0 commit comments

Comments
 (0)