Skip to content

Conversation

@rohansjoshi
Copy link
Contributor

@rohansjoshi rohansjoshi commented Jul 10, 2025

Summary:
llama.py now has the --range_setting flag, for which there are the options mse_weight_only and mse_with_act_loss. There is also an eval script for computing perplexity called eval_llama_qnn.py (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

Commands:

python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss

python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss

Update: I've also added SpinQuant as a feature to further improve accuracy. Both llama.py and eval_llama_qnn.py also have a flag --spinquant which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B for 16a4w, I get the best results using both --spinquant and --range_setting mse_with_act_loss.

Rollback Plan:

Differential Revision: D78127727

@rohansjoshi rohansjoshi requested a review from cccclai as a code owner July 10, 2025 23:02
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12377

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 72da5c1 with merge base 6d86fa9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78127727

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

rohansjoshi added a commit to rohansjoshi/executorch that referenced this pull request Jul 14, 2025
Summary:

`llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

Commands:

```python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss```

```python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss```

Rollback Plan:

Differential Revision: D78127727
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78127727

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reading, will finish reading in a bit

model.ar_len = model.max_seq_len
tokens, atten_mask = model.get_example_inputs(use_kv_cache=False)
atten_mask.to(torch.float)
print(atten_mask.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing debugging line

kv_quant_attrs=kv_quant_attrs,
),
)
# custom_annotations = custom_annotations + (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I need to have a separate PR for this.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78127727

rohansjoshi added a commit to rohansjoshi/executorch that referenced this pull request Jul 21, 2025
Summary:

`llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py`. This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

(for faster eval, try seq length 1024)

Reviewed By: cccclai

Differential Revision: D78127727
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78127727

@cccclai
Copy link
Contributor

cccclai commented Jul 22, 2025

@winskuo-quic @haowhsu-quic Hi folks, we're landing this PR as it can help the community users to generate reasonable results. Once #12700 is tested and evaluated, we can switch over.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the accuracy on llama + htp!

rohansjoshi added a commit to rohansjoshi/executorch that referenced this pull request Jul 22, 2025
Summary:

`llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`.

Reviewed By: cccclai

Differential Revision: D78127727
rohansjoshi added a commit to rohansjoshi/executorch that referenced this pull request Jul 22, 2025
Summary:

`llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`.

Reviewed By: cccclai

Differential Revision: D78127727
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78127727

Summary:

`llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.

Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`.

Reviewed By: cccclai

Differential Revision: D78127727
@facebook-github-bot facebook-github-bot merged commit e5e5dab into pytorch:main Jul 22, 2025
99 checks passed
Conarnar pushed a commit to Conarnar/executorch that referenced this pull request Jul 25, 2025
Differential Revision: D78127727

Pull Request resolved: pytorch#12377
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants