-
Notifications
You must be signed in to change notification settings - Fork 751
Qualcomm AI Engine Direct - Support MaskedSoftmax in static llama #12745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Support MaskedSoftmax in static llama #12745
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12745
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit cf02afa with merge base 45846c8 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
Hi @cccclai, This PR enables the new "MaskedSoftmax" feature for LLM on the HTP backend. MaskedSoftmax is used to replace specific patterns, such as the Softmax(Add(In, Mask)) structure. It is supported staring from QNN 2.35. Since this is a backend optimization, we need to check optrace to confirm if it is successfully enabled. I have evaluated its performance on story llama and llama 3.2 1/3B, and it shows a slight improvement in TTFT and token generation rate.
|
| raise RuntimeError(f"Using an unknown kv update {args.kv_updater}") | ||
|
|
||
| if args.enable_masked_softmax and is_qnn_sdk_version_less_than("2.35"): | ||
| logging.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Is there a way to query the qnn sdk version? Just in case weird behavior if users didn't notice the warnning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary: - Add a unit test for masked softmax - Add amin op support - Add a flag `--enable_masked_softmax` to enable masked softmax feature. It is designed to optimize the LLMs accuracy and performance executed on HTP backend. MaskedSoftmax is used to replace the Softmax(Add(In, Mask)) structure in attention block in LLMs during backend optimization. For more details, please refer to QNN documents. Note that it is only supported starting from QNN 2.35.
2c88338 to
cf02afa
Compare
|
Hi @cccclai, I have updated the API to get sdk version. Does it look good now? |
cccclai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thank you


Summary:
--enable_masked_softmaxto enable masked softmax feature.It is designed to optimize the LLMs accuracy and performance executed on HTP backend. MaskedSoftmax is used to replace the Softmax(Add(In, Mask)) structure in attention block in LLMs during backend optimization. For more details, please refer to QNN documents.
Note that it is only supported starting from QNN 2.35.
cc: @haowhsu-quic , @winskuo-quic