-
Notifications
You must be signed in to change notification settings - Fork 752
Fix xnnpack quantization discrepancy for non-fp32 #8488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8488
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit be1921c with merge base 0dd7e4e ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
e68b028 to
cf0cb9d
Compare
a85ce2a to
77df3eb
Compare
| edge_manager = edge_manager.set_output_dir(output_dir_path).source_transform( | ||
| _get_source_transforms( | ||
| args.model, DType.from_torch_dtype(checkpoint_dtype), args | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
edge_manager.model(
torch.tensor([[2, 3, 4]], dtype=torch.long),
{"input_pos": torch.tensor([0], dtype=torch.long)},
)
Here to test
| # We want to do compute the actual ops in the precision of the dtype_override, | ||
| # since the precision of the quantized linear will initially be the dtype of the | ||
| # checkpoint, not the dtype_override. | ||
| def _set_precision_to_fp32(module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kimishpatel for the issue we were discussing
examples/models/llama/model.py
Outdated
| # Convert the model's weights only to the checkpoint's dtype, so that | ||
| # the checkpoint can be loaded into the model's state dict in its | ||
| # own dtype w/o potential precision loss. | ||
| for param in self.model_.parameters(): | ||
| param.data = param.data.to(dtype=self.checkpoint_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldnt have to do this if the checkpoint is directly loaded, no? Not sure whats happening with self.model_.to(....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.model_.to(.... needs to happen before the the params are set to the checkpoint dtype so that we end up with our weights in the checkpoint dtype (needed for quantization) and the rest of the model in the dtype override. The when we load the checkpoint, no dtype promotion will be happening.
This dtype promotion is technically always lossless since all of the dtypes we support have lossless conversion to fp32, but I'm doing this in case we want to support dtypes in the future that don't have lossless conversion to fp32. If we can make this assumption though, then we can decouple model.py from dtype casting and move the logic outside which I think @larryliu0820 was looking to do.
| .source_transform(_get_source_transforms(args.model, dtype_override, args)) | ||
| ) | ||
|
|
||
| _set_quantized_computation_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides the changes you are doing in this function, dont you also need to do edge_manager.model.to(self.dtype)?
|
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
|
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
00c8f4a to
82d748d
Compare
|
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
25d5ac7 to
49ed26d
Compare
|
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by performing source transformations before dtype cast. Previously the model was being converted to the `dtype_override` arg's dtype and then quantized. This eliminates supposedly eliminates quantization noise. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Differential Revision: D70184325 Pulled By: jackzhxng
49ed26d to
91c0d0c
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70184325 |
91c0d0c to
a2f9cbe
Compare
Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by performing source transformations before dtype cast. Previously the model was being converted to the `dtype_override` arg's dtype and then quantized. This eliminates supposedly eliminates quantization noise. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Differential Revision: D70184325 Pulled By: jackzhxng
3ccdcfd to
5daaf19
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70184325 |
|
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by passing in the checkpoint dtype to the quantization source transformation and modifying the computation dtype (the result dtype of the dequant, the dtype that the ops are actually computed in) to the dtype override. We must do it this way since the checkpoint and computation dtype are coupled into a single `precision` parameter in the torchao api, and that is something that we cannot change. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Reviewed By: kimishpatel Differential Revision: D70184325 Pulled By: jackzhxng
e0b9234 to
2c1a355
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70184325 |
Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by passing in the checkpoint dtype to the quantization source transformation and modifying the computation dtype (the result dtype of the dequant, the dtype that the ops are actually computed in) to the dtype override. We must do it this way since the checkpoint and computation dtype are coupled into a single `precision` parameter in the torchao api, and that is something that we cannot change. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Reviewed By: kimishpatel Differential Revision: D70184325 Pulled By: jackzhxng
2c1a355 to
be1921c
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70184325 |
Summary
Perform quantization on the weights expressed in their original dtype (from the checkpoint) by performing source transformations before dtype cast. Previously the model was being converted to the
dtype_overridearg's dtype and then quantized. This eliminates supposedly eliminates quantization noise.Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype
Comparison of arbitrary q_proj tensor from sample Llama checkpoint:
Before:
After: no difference
Test plan
Manual testing
With the following inserted after the quantization:
And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing.