You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[AWQ] Support accumulation for reduced memory usage (#1435)
> [!NOTE]
> @brian-dellabetta updated summary here (took over this PR from
@kylesayrs )
### Summary
This update removes the `Catcher` logic ported from AutoAWQ, instead
using the SequentialPipeline features and a couple hooks to cache args
into module forward passes, as needed to run AWQ. Should cause no
significant change in results, but this should be a more accurate
implementation because kwargs are cached for each parent layer, rather
than re-using those of the first module.
Leveraging `IntermediatesCache` for cached values, this also exposes a
new `offload_cache` bool value on `AWQModifier`, if set to True it will
offload cached values at the expense of slower runtime. With
`meta-llama/Llama-2-7b-hf`, offloading decreases max GPU memory from
~27GB to ~20GB, at the cost of `apply_smoothing` taking ~17 seconds per
iteration as opposed to ~5 seconds. Because of this, I am leaving the
default to not offload, just noting in the docstring to toggle this if
users encounter OOM errors.
### Test Plan
Confirmed that these changes don't significantly alter PPL scores
relative to current implementation on main.
- `meta-llama/Llama-3.2-3B-Instruct`
- PPL 14.1523 on main, 14.081 on this branch
- `Qwen/Qwen2.5-7B-Instruct`
- PPL 10.411 on main, 10.736 on this branch
- `meta-llama/Llama-2-7b-hf`
- PPL 9.5075 on main, 9.503 on this branch
---------
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
0 commit comments