Implemented range setting in QNN llama flow #12377

rohansjoshi · 2025-07-10T23:02:26Z

Summary:
llama.py now has the --range_setting flag, for which there are the options mse_weight_only and mse_with_act_loss. There is also an eval script for computing perplexity called eval_llama_qnn.py (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

Commands:

python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss

python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss

Update: I've also added SpinQuant as a feature to further improve accuracy. Both llama.py and eval_llama_qnn.py also have a flag --spinquant which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B for 16a4w, I get the best results using both --spinquant and --range_setting mse_with_act_loss.

Rollback Plan:

Differential Revision: D78127727

pytorch-bot · 2025-07-10T23:02:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12377

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

linux.aws.h100.8 instance is down, potentially longer queue on linux.aws.h100

✅ No Failures

As of commit 72da5c1 with merge base 6d86fa9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-07-10T23:02:35Z

This pull request was exported from Phabricator. Differential Revision: D78127727

github-actions · 2025-07-10T23:03:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Commands: ```python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss``` ```python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss``` Rollback Plan: Differential Revision: D78127727

facebook-github-bot · 2025-07-14T23:29:39Z

This pull request was exported from Phabricator. Differential Revision: D78127727

cccclai

Still reading, will finish reading in a bit

cccclai · 2025-07-15T18:40:56Z

examples/qualcomm/oss_scripts/llama/llama.py

+            model.ar_len = model.max_seq_len
+            tokens, atten_mask = model.get_example_inputs(use_kv_cache=False)
+            atten_mask.to(torch.float)
+            print(atten_mask.shape)


Removing debugging line

cccclai · 2025-07-15T18:41:43Z

examples/qualcomm/oss_scripts/llama/llama.py

-                        kv_quant_attrs=kv_quant_attrs,
-                    ),
-                )
+                # custom_annotations = custom_annotations + (


Actually I need to have a separate PR for this.

facebook-github-bot · 2025-07-21T05:25:29Z

This pull request was exported from Phabricator. Differential Revision: D78127727

Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py`. This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. (for faster eval, try seq length 1024) Reviewed By: cccclai Differential Revision: D78127727

facebook-github-bot · 2025-07-21T21:45:20Z

This pull request was exported from Phabricator. Differential Revision: D78127727

cccclai · 2025-07-22T18:49:53Z

@winskuo-quic @haowhsu-quic Hi folks, we're landing this PR as it can help the community users to generate reasonable results. Once #12700 is tested and evaluated, we can switch over.

cccclai

Thanks for improving the accuracy on llama + htp!

Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`. Reviewed By: cccclai Differential Revision: D78127727

facebook-github-bot · 2025-07-22T19:56:55Z

This pull request was exported from Phabricator. Differential Revision: D78127727

Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`. Reviewed By: cccclai Differential Revision: D78127727

Differential Revision: D78127727 Pull Request resolved: pytorch#12377

rohansjoshi requested a review from cccclai as a code owner July 10, 2025 23:02

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025

facebook-github-bot added the fb-exported label Jul 10, 2025

rohansjoshi force-pushed the export-D78127727 branch from a457091 to d55c96d Compare July 14, 2025 23:29

cccclai reviewed Jul 15, 2025

View reviewed changes

rohansjoshi force-pushed the export-D78127727 branch from d55c96d to 0f86944 Compare July 21, 2025 05:25

rohansjoshi force-pushed the export-D78127727 branch from 0f86944 to 4e19891 Compare July 21, 2025 21:45

cccclai approved these changes Jul 22, 2025

View reviewed changes

rohansjoshi force-pushed the export-D78127727 branch from 4e19891 to 5d1d4ae Compare July 22, 2025 19:03

rohansjoshi force-pushed the export-D78127727 branch from 5d1d4ae to 25d557f Compare July 22, 2025 19:56

rohansjoshi force-pushed the export-D78127727 branch from 25d557f to 72da5c1 Compare July 22, 2025 21:30

facebook-github-bot merged commit e5e5dab into pytorch:main Jul 22, 2025
99 checks passed

Conarnar pushed a commit to Conarnar/executorch that referenced this pull request Jul 25, 2025

Implemented range setting in QNN llama flow

16d7ba7

Differential Revision: D78127727 Pull Request resolved: pytorch#12377

Implemented range setting in QNN llama flow #12377

Implemented range setting in QNN llama flow #12377

Uh oh!

Conversation

rohansjoshi commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12377

❗ 1 Active SEVs

✅ No Failures

Uh oh!

facebook-github-bot commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 10, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Jul 14, 2025

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

cccclai Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

cccclai Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 21, 2025

Uh oh!

facebook-github-bot commented Jul 21, 2025

Uh oh!

cccclai commented Jul 22, 2025

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohansjoshi commented Jul 10, 2025 •

edited

Loading

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading

This PR needs a `release notes:` label