-
Notifications
You must be signed in to change notification settings - Fork 388
Description
normally the way AWQ works is to pick a layer that is going to be quantized, try a bunch of scale factors to find the one that minimizes quantization error when that layer is quantized and then do an inverse rescale on the preceeding layer which is normally not quantized. However a problem arises for the up_proj -> down_proj mapping
because both the smooth and balance layers are targeted for quantization. Since we only take into account the quantization of the balance layers in our current AWQ implementation, we could be making the smooth layer harder to quantize with our choice of scale factor for the balance layer since the smooth layer is basically ignored during the quantization error calculation for quantizing the balance layer.
We should
- test if this has a significant impact
- add an option to enable this feature if its beneficial
STEPS
A) add a check here for whether smooth_name is in targeted_names and if so, change the get_lowest_common...etc search to include the smooth layer (this is how we determine what module is run to determine the quantization error, so we need smooth layer to be run if we're taking its quantization into accout)
B) add a flag to compute_best_scale for if the smooth layer is targeted
C) if necessary add the smooth layer to this dict
D) move the rescale weight code into a function which is called for each balance layer
E) if necessary, call the rescale weight code for 1/_scalesview on the smooth_layer
F) check whether this has an impact on lm_eval performance on some small set of models.
G) check how this affects the runtime of AWQ for those models.
if its beneficial then put up a PR with those changes demonstrating what was tested and how it affects things.