Fix xnnpack quantization discrepancy for non-fp32 #8488

jackzhxng · 2025-02-14T00:29:47Z

Summary

Perform quantization on the weights expressed in their original dtype (from the checkpoint) by performing source transformations before dtype cast. Previously the model was being converted to the dtype_override arg's dtype and then quantized. This eliminates supposedly eliminates quantization noise.

Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype

Comparison of arbitrary q_proj tensor from sample Llama checkpoint:

Before:

Mismatched elements: 3260378 / 4194304 (77.7%)
Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed)
Signal-to-noise: 32.8974 dB

After: no difference

Test plan

Manual testing

python -m examples.models.llama.export_llama \
-v -c xl_consolidated/consolidated_renamed.pth \
-p xl_consolidated/et_params.json -kv -d fp32 \
-qmode 8da4w --group_size 32 -X \
--use_sdpa_with_kv_cache \
--output_name quantized_baseline.pte \
--max_context_length 4096 -E 4,32

With the following inserted after the quantization:

edge_manager.model(
    torch.tensor([[2, 3, 4]], dtype=torch.long),
    {"input_pos": torch.tensor([0], dtype=torch.long)},
)

And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing.

pytorch-bot · 2025-02-14T00:29:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8488

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit be1921c with merge base 0dd7e4e ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for examples/models/moshi/mimi/test_mimi.py:
pull / unittest-editable / macos / macos-job (gh)
backends/xnnpack/test/passes/test_convert_to_linear.py::TestConvertToLinear::test_fp32_convert_to_linear
trunk / test-arm-reference-delegation / linux-job (gh)
RuntimeError: Command docker exec -t c9543602b99ecbc35d09b6bb7353fe5987fd67d633ef385abcc0113e5f6026e3 /exec failed with exit code 1
trunk / unittest-release / macos / macos-job (gh)
backends/xnnpack/test/ops/test_check_quant_params.py::TestCheckQuantParams::test_inject_invalid_zp

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jackzhxng · 2025-02-24T22:40:03Z

examples/models/llama/export_llama_lib.py

+    edge_manager = edge_manager.set_output_dir(output_dir_path).source_transform(
+        _get_source_transforms(
+            args.model, DType.from_torch_dtype(checkpoint_dtype), args
        )


Added

edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, )

Here to test

jackzhxng · 2025-02-24T22:40:20Z

examples/models/llama/export_llama_lib.py

+    # We want to do compute the actual ops in the precision of the dtype_override,
+    # since the precision of the quantized linear will initially be the dtype of the
+    # checkpoint, not the dtype_override.
+    def _set_precision_to_fp32(module):


cc @kimishpatel for the issue we were discussing

examples/models/llama/source_transformation/quantize.py

examples/models/llama/model.py

kimishpatel · 2025-02-25T04:02:04Z

examples/models/llama/model.py

+            # Convert the model's weights only to the checkpoint's dtype, so that
+            # the checkpoint can be loaded into the model's state dict in its
+            # own dtype w/o potential precision loss.
+            for param in self.model_.parameters():
+                param.data = param.data.to(dtype=self.checkpoint_dtype)


We shouldnt have to do this if the checkpoint is directly loaded, no? Not sure whats happening with self.model_.to(....

self.model_.to(.... needs to happen before the the params are set to the checkpoint dtype so that we end up with our weights in the checkpoint dtype (needed for quantization) and the rest of the model in the dtype override. The when we load the checkpoint, no dtype promotion will be happening.

This dtype promotion is technically always lossless since all of the dtypes we support have lossless conversion to fp32, but I'm doing this in case we want to support dtypes in the future that don't have lossless conversion to fp32. If we can make this assumption though, then we can decouple model.py from dtype casting and move the logic outside which I think @larryliu0820 was looking to do.

examples/models/llama/export_llama_lib.py

kimishpatel · 2025-02-25T04:04:38Z

examples/models/llama/export_llama_lib.py

-        .source_transform(_get_source_transforms(args.model, dtype_override, args))
    )

+    _set_quantized_computation_dtype(


besides the changes you are doing in this function, dont you also need to do edge_manager.model.to(self.dtype)?

facebook-github-bot · 2025-02-25T17:17:25Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-02-25T22:40:59Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-02-27T05:24:31Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-02-27T16:41:30Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by performing source transformations before dtype cast. Previously the model was being converted to the `dtype_override` arg's dtype and then quantized. This eliminates supposedly eliminates quantization noise. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Differential Revision: D70184325 Pulled By: jackzhxng

facebook-github-bot · 2025-02-27T17:05:14Z

This pull request was exported from Phabricator. Differential Revision: D70184325

Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by performing source transformations before dtype cast. Previously the model was being converted to the `dtype_override` arg's dtype and then quantized. This eliminates supposedly eliminates quantization noise. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Differential Revision: D70184325 Pulled By: jackzhxng

facebook-github-bot · 2025-03-19T00:09:26Z

This pull request was exported from Phabricator. Differential Revision: D70184325

facebook-github-bot · 2025-03-19T02:27:21Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by passing in the checkpoint dtype to the quantization source transformation and modifying the computation dtype (the result dtype of the dequant, the dtype that the ops are actually computed in) to the dtype override. We must do it this way since the checkpoint and computation dtype are coupled into a single `precision` parameter in the torchao api, and that is something that we cannot change. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Reviewed By: kimishpatel Differential Revision: D70184325 Pulled By: jackzhxng

facebook-github-bot · 2025-03-21T21:49:27Z

This pull request was exported from Phabricator. Differential Revision: D70184325

Summary: Perform quantization on the weights expressed in their original dtype (from the checkpoint) by passing in the checkpoint dtype to the quantization source transformation and modifying the computation dtype (the result dtype of the dequant, the dtype that the ops are actually computed in) to the dtype override. We must do it this way since the checkpoint and computation dtype are coupled into a single `precision` parameter in the torchao api, and that is something that we cannot change. Note - no need to worry about https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py#L1168, precision is passed in with the checkpoint dtype ### Comparison of arbitrary q_proj tensor from sample Llama checkpoint: Before: ``` Mismatched elements: 3260378 / 4194304 (77.7%) Greatest absolute difference: 0.08802086114883423 at index (1129, 604) (up to 1e-05 allowed) Greatest relative difference: 1.0 at index (0, 1350) (up to 1.3e-06 allowed) Signal-to-noise: 32.8974 dB ``` After: no difference Test Plan: ### Manual testing ``` python -m examples.models.llama.export_llama \ -v -c xl_consolidated/consolidated_renamed.pth \ -p xl_consolidated/et_params.json -kv -d fp32 \ -qmode 8da4w --group_size 32 -X \ --use_sdpa_with_kv_cache \ --output_name quantized_baseline.pte \ --max_context_length 4096 -E 4,32 ``` With the following inserted after the quantization: ``` edge_manager.model( torch.tensor([[2, 3, 4]], dtype=torch.long), {"input_pos": torch.tensor([0], dtype=torch.long)}, ) ``` And the following modifications to GPTQ.py in torchao: pytorch/ao#1756 for testing. ### Automated testing + existing CI tests ### Regression testing TBD Reviewed By: kimishpatel Differential Revision: D70184325 Pulled By: jackzhxng

facebook-github-bot · 2025-03-22T01:10:38Z

This pull request was exported from Phabricator. Differential Revision: D70184325

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 14, 2025

jackzhxng marked this pull request as draft February 14, 2025 00:29

jackzhxng force-pushed the jz/dtype-shennanigans branch from e68b028 to cf0cb9d Compare February 14, 2025 00:33

jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Feb 21, 2025

jackzhxng force-pushed the jz/dtype-shennanigans branch from a85ce2a to 77df3eb Compare February 21, 2025 21:01

jackzhxng marked this pull request as ready for review February 21, 2025 21:01

jackzhxng requested review from iseeyuan, larryliu0820 and lucylq as code owners February 21, 2025 21:01

jackzhxng changed the title ~~Source transforms on model in original ckpt weights before dtype cast~~ Fix bf16 quantization noise Feb 24, 2025

jackzhxng mentioned this pull request Feb 24, 2025

Decouple quantization and activation dtypes for non-xnnpack #8652

Closed

jackzhxng commented Feb 24, 2025

View reviewed changes

jackzhxng changed the title ~~Fix bf16 quantization noise~~ Fix xnnpack quantization discrepancy for non-fp32 Feb 25, 2025

jackzhxng requested a review from kimishpatel February 25, 2025 01:01

jackzhxng added the ciflow/trunk label Feb 25, 2025

kimishpatel reviewed Feb 25, 2025

View reviewed changes

examples/models/llama/source_transformation/quantize.py Outdated Show resolved Hide resolved

kimishpatel reviewed Feb 25, 2025

View reviewed changes

examples/models/llama/model.py Show resolved Hide resolved

kimishpatel reviewed Feb 25, 2025

View reviewed changes

examples/models/llama/model.py Show resolved Hide resolved

kimishpatel reviewed Feb 25, 2025

View reviewed changes

examples/models/llama/export_llama_lib.py Outdated Show resolved Hide resolved

kimishpatel reviewed Feb 25, 2025

View reviewed changes

jackzhxng temporarily deployed to upload-benchmark-results February 25, 2025 11:20 — with GitHub Actions Inactive

jackzhxng force-pushed the jz/dtype-shennanigans branch from 00c8f4a to 82d748d Compare February 27, 2025 03:10

jackzhxng force-pushed the jz/dtype-shennanigans branch from 25d5ac7 to 49ed26d Compare February 27, 2025 16:40

facebook-github-bot force-pushed the jz/dtype-shennanigans branch from 49ed26d to 91c0d0c Compare February 27, 2025 17:04

facebook-github-bot requested review from JacobSzwejbka and tarun292 as code owners February 27, 2025 17:04

facebook-github-bot added the fb-exported label Feb 27, 2025

jackzhxng closed this Mar 19, 2025

jackzhxng force-pushed the jz/dtype-shennanigans branch from 91c0d0c to a2f9cbe Compare March 19, 2025 00:06

jackzhxng reopened this Mar 19, 2025

facebook-github-bot force-pushed the jz/dtype-shennanigans branch from 3ccdcfd to 5daaf19 Compare March 19, 2025 00:09

facebook-github-bot requested a review from swolchok as a code owner March 19, 2025 00:09

jackzhxng mentioned this pull request Mar 19, 2025

Llama‘s freqs_cos data loss as for convert dtype #9393

Closed

facebook-github-bot force-pushed the jz/dtype-shennanigans branch from e0b9234 to 2c1a355 Compare March 21, 2025 21:49

tarun292 approved these changes Mar 22, 2025

View reviewed changes

facebook-github-bot force-pushed the jz/dtype-shennanigans branch from 2c1a355 to be1921c Compare March 22, 2025 01:10

facebook-github-bot merged commit 38851a1 into main Mar 22, 2025
159 of 166 checks passed

facebook-github-bot deleted the jz/dtype-shennanigans branch March 22, 2025 08:32

Fix xnnpack quantization discrepancy for non-fp32 #8488

Fix xnnpack quantization discrepancy for non-fp32 #8488

Uh oh!

Conversation

jackzhxng commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Comparison of arbitrary q_proj tensor from sample Llama checkpoint:

Test plan

Manual testing

Uh oh!

pytorch-bot bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8488

❌ 4 New Failures

Uh oh!

jackzhxng Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kimishpatel Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kimishpatel Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Feb 25, 2025

Uh oh!

facebook-github-bot commented Feb 25, 2025

Uh oh!

facebook-github-bot commented Feb 27, 2025

Uh oh!

facebook-github-bot commented Feb 27, 2025

Uh oh!

facebook-github-bot commented Feb 27, 2025

Uh oh!

facebook-github-bot commented Mar 19, 2025

Uh oh!

facebook-github-bot commented Mar 19, 2025

Uh oh!

facebook-github-bot commented Mar 21, 2025

Uh oh!

facebook-github-bot commented Mar 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jackzhxng commented Feb 14, 2025 •

edited

Loading

pytorch-bot bot commented Feb 14, 2025 •

edited

Loading