-
Notifications
You must be signed in to change notification settings - Fork 724
Qualcomm AI Engine Direct - Optimize QNN embedding op for llama #6725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Optimize QNN embedding op for llama #6725
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6725
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5aac8b6 with merge base 39e5b91 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @cccclai,
Thank you for your effort. |
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
examples/models/llama/model.py
Outdated
| return ( | ||
| torch.tensor( | ||
| [[1]], dtype=torch.long | ||
| [[1]], dtype=torch.int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put it in model metadata and get it in runtime to avoid misuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I miss this change. Yes, I just would like to change it with QNN Backend.
|
Hi this PR breaks quite a few internal tests....Since the test are internal, I need to send you the fix patch before landing it.. |
summary: - Change the dtype of the token from int64 to in32 Int32 is Qnn HTP friendly. It will significantly speed up qnn embedding operations due to matching backend optimizations.
ddade6d to
73e2f7c
Compare
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
| TextTokenGenerator( | ||
| Tokenizer* tokenizer, | ||
| TextDecoderRunner* text_decoder_runner, | ||
| bool use_int32_token, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing BC breaking and causes CI failure. Can we set a default value so it's BC compatible? Also assert when it doesn't match metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI breaking stack trace can be found here
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
|
||
| TextPrefiller::TextPrefiller( | ||
| TextDecoderRunner* text_decoder_runner, | ||
| bool use_int32_token, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I meant more like
TextPrefiller::TextPrefiller(
TextDecoderRunner* text_decoder_runner,
bool use_kv_cache,
bool enable_parallel_prefill,
bool use_int32_token=False)
so we don't need to update all callsite...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about misunderstanding. I have updated. Thanks!
8cbb91d to
5aac8b6
Compare
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Is this PR still needed? |
I think we are focusing on static llama maybe we can close this PR. |
summary:
Test the PR for llama 3.2 1B instruct with seq_len=512 on SM8650


Test the mainline