- 
                Notifications
    You must be signed in to change notification settings 
- Fork 699
Implemented range setting in QNN llama flow #12377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented range setting in QNN llama flow #12377
Conversation
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12377
 Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 72da5c1 with merge base 6d86fa9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
| This pull request was exported from Phabricator. Differential Revision: D78127727 | 
| This PR needs a  | 
Summary:
`llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.
Commands:
```python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss```
```python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss```
Rollback Plan:
Differential Revision: D78127727
    a457091    to
    d55c96d      
    Compare
  
    | This pull request was exported from Phabricator. Differential Revision: D78127727 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reading, will finish reading in a bit
| model.ar_len = model.max_seq_len | ||
| tokens, atten_mask = model.get_example_inputs(use_kv_cache=False) | ||
| atten_mask.to(torch.float) | ||
| print(atten_mask.shape) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing debugging line
| kv_quant_attrs=kv_quant_attrs, | ||
| ), | ||
| ) | ||
| # custom_annotations = custom_annotations + ( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I need to have a separate PR for this.
d55c96d    to
    0f86944      
    Compare
  
    | This pull request was exported from Phabricator. Differential Revision: D78127727 | 
0f86944    to
    4e19891      
    Compare
  
    Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py`. This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. (for faster eval, try seq length 1024) Reviewed By: cccclai Differential Revision: D78127727
| This pull request was exported from Phabricator. Differential Revision: D78127727 | 
| @winskuo-quic @haowhsu-quic Hi folks, we're landing this PR as it can help the community users to generate reasonable results. Once #12700 is tested and evaluated, we can switch over. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the accuracy on llama + htp!
4e19891    to
    5d1d4ae      
    Compare
  
    Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`. Reviewed By: cccclai Differential Revision: D78127727
Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`. Reviewed By: cccclai Differential Revision: D78127727
5d1d4ae    to
    25d557f      
    Compare
  
    | This pull request was exported from Phabricator. Differential Revision: D78127727 | 
Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Update: I've also added SpinQuant as a feature to further improve accuracy. Both `llama.py` and `eval_llama_qnn.py` also have a flag `--spinquant` which can be used in combination with range setting or by itself. Based on my experiments on Llama 1B, I get the best results using both `--spinquant` and `--range_setting mse_with_act_loss`. Reviewed By: cccclai Differential Revision: D78127727
25d557f    to
    72da5c1      
    Compare
  
    Differential Revision: D78127727 Pull Request resolved: pytorch#12377
Summary:
llama.pynow has the--range_settingflag, for which there are the optionsmse_weight_onlyandmse_with_act_loss. There is also an eval script for computing perplexity calledeval_llama_qnn.py(for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.Commands:
python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_losspython examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_lossUpdate: I've also added SpinQuant as a feature to further improve accuracy. Both
llama.pyandeval_llama_qnn.pyalso have a flag--spinquantwhich can be used in combination with range setting or by itself. Based on my experiments on Llama 1B for 16a4w, I get the best results using both--spinquantand--range_setting mse_with_act_loss.Rollback Plan:
Differential Revision: D78127727