Conversation
smsesmse
This PR adds `smse` option to GPTQ to improve accuracy. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
smsesmse
|
@mhs4670go |
| type=str, | ||
| default=None, | ||
| help="Whether and how to use mse in gptq (none/mse/smse/)", |
There was a problem hiding this comment.
How about using choices instead?
|
|
||
| sens = None | ||
| if args.gptq_mse is not None and ( | ||
| args.gptq_mse == "smse" or args.gptq_mse == "smse_for_gptq" |
There was a problem hiding this comment.
smse_for_gptq seems a bit duplicate option name. Is it necessar to have?
There was a problem hiding this comment.
Sorry. I'll remove it.
There was a problem hiding this comment.
IMHO, there's no explanation about the smse feature. What is smse or why sensitivity come in, etc. Could you add some documents for this? at REMADME.md or some places that you think is a good place?
There was a problem hiding this comment.
IMHO, there's no explanation about the smse feature. What is smse or why sensitivity come in, etc. Could you add some documents for this? at REMADME.md or some places that you think is a good place?
ok.
| outputs = logits.squeeze() | ||
| targets = targets.squeeze() | ||
|
|
||
| b_indices = [outputs.shape[0] - 1] # priority to the last token |
There was a problem hiding this comment.
Just curiosity, b_indices alwyas a list whose size is one. Below for loop is necessary?
There was a problem hiding this comment.
Currently - no. I'll remove it. Thank you!
| return dataloader | ||
|
|
||
|
|
||
| class SensitivityCalibrator: |
There was a problem hiding this comment.
Empirical Fisher Information
For the reviewers, Empirical Fisher Information is a practical way to estimate how important each model parameter is.
Intuitively, the idea is simple:
If changing a weight causes the model's output to change a lot, that weight is important.
If changing a weight barely affects the output, that weight is less important.
A common way to estimate this is to look at the squared gradients of the loss. Large gradients mean the model is sensitive to that weight, so it should be treated more carefully (e.g., during quantization).
However, Fisher Information is defined with respect to samples drawn from the model's own probability distribution. In practice, instead of sampling from the full distribution (which is expensive), many implementations simply use the model's own prediction as a pseudo label:
target ≈ argmax(logits)
So the procedure becomes:
- Run the model to obtain logits.
- Use the predicted token (
argmax) as a pseudo target. - Compute the loss and gradients.
- Accumulate squared gradients to estimate parameter sensitivity.
The resulting values serve as a weight importance / sensitivity estimate, which can then be used to guide quantization.
| # update second order information as current weights gradients are ready | ||
| for name in modules_to_process: | ||
| cur_module = modules_to_process[name] | ||
| cur_grad = copy.deepcopy(cur_module.weight.grad.detach()) # type: ignore[union-attr] |
There was a problem hiding this comment.
Is this deepcopy necessary? Or, how about using clone() instead?
There was a problem hiding this comment.
Previously I had some issues with clone(). But currently clone() seems to work just fine. Thank you!
There was a problem hiding this comment.
I'm not sure but current structure looks like this.
# pass 1: sensitivity
for inp in calib_inputs:
forward
backward
accumlate grad^2
# pass 2: GPTQ
for inp in calib_inputs:
forward
accululate GPTQ statsYou don't have to revisit right now but it would be better to do the same thing wiht only one pass later.
Addtionally, classes or apis for smse could be moved to gptq/utils.py instead.
There was a problem hiding this comment.
I'm not sure but current structure looks like this.
@mhs4670go
Yes. Right now they are similar (just iterating through the dataset). I just tried to introduce minimal changes to GPTQ . But to merge them in a single pass is a good idea. There is just a possibility to use external sensitivity, to try something different, may be calibrating sensitivity on another dataset. That's why IMHO merging them can be done later. Or should i remove external sensitivity and merge sensitivity pass with inference pass?
Addtionally, classes or apis for smse could be moved to gptq/utils.py instead.
I'll do it.
Thank you!
There was a problem hiding this comment.
@mhs4670go
Moreover sensitivites can be used elsewhere (MPQ solution e.g.).
There was a problem hiding this comment.
That's why IMHO merging them can be done later.
As you said, I think you can do it later. Please feel free to do the work later.
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
| `mse` parameter of `GPTQConfig` is supposed to tune quantizer for using in GPTQ. | ||
| There are two options : | ||
| 1. `mse`- vanilla `mse`. Produce quantization parameters for GPTQ quantizer (`min`\`max`) which minimize mean squared error of quantization. $MSE\_MIN\_MAX\_FOR\_W = argmin_{min, max}||W-Q_{min, max}(W)||^2$. | ||
| 2. `smse` - sensitivity-based `mse`. Use sensitivity of some global feature (e.g. float model logits) to parameters change to minimize global effect of quantization. $SMSE\_MIN\_MAX\_FOR\_W = argmin_{min, max}|(W-Q_{min, max}(W))^2*Sensitivity(W)|$. So we try to keep `important` parameters unchanged, while quantizing `unimportant` parameters more aggressively. |
There was a problem hiding this comment.
The md format words doesn't show properly.
There was a problem hiding this comment.
Ahhh. That's sad. Thank you!
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
This PR adds
smseoption to GPTQ to improve accuracy.TLLama1B
benchmarks
Original RESULTS
MSE :
SMSE :
logs
mse
smse
LLama3.2-1B
benchmarks
Original:
mse:
smse:
logs
mse
smse
LLama3.2-3B
benchmarks
Original:
mse :
smse:
logs
mse
smse
Note for reviewers:
Although
smseprovides the best PPL onwikipediait does not provide the best performance on benchmarks:(Seems like smse can overfit over
wikipediain exchange for some other tasks).for increasing number of samples we get:
although
pplincreased slightly we got no performance drop on any of the benchmarks.logs
smse_256
smse_512
So in personal and humble opinion, in case we had some refined balanced dataset (not
wikitext)smsewill be able to considerably improve model performance.TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com