Qualcomm AI Engine Direct - Optimize QNN embedding op for llama #6725

shewu-quic · 2024-11-08T04:07:03Z

summary:

Change the dtype of the token from int64 to in32 since Int32 is Qnn HTP friendly. It will allow the embedding op to run into backend optimizations to significantly speed up qnn embedding operations.

Test the PR for llama 3.2 1B instruct with seq_len=512 on SM8650

Test the mainline

pytorch-bot · 2024-11-08T04:07:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6725

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5aac8b6 with merge base 39e5b91 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shewu-quic · 2024-11-08T04:10:52Z

Hi @cccclai,
This PR is to change the dtype of token to optimize embedding op.
Could you help take a look?
And then, I will create the following PRs which speeds up the performance of llama 3.2 1B/3B with QNN Backend ASAP.

Delegate mutable buffer
Enable llama3.2 on static llama

Thank you for your effort.

facebook-github-bot · 2024-11-08T18:32:48Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-11-08T18:34:21Z

examples/models/llama/model.py

            return (
                torch.tensor(
-                    [[1]], dtype=torch.long
+                    [[1]], dtype=torch.int32


Can we put it in model metadata and get it in runtime to avoid misuse

Oops, I miss this change. Yes, I just would like to change it with QNN Backend.

cccclai · 2024-11-09T19:36:13Z

Hi this PR breaks quite a few internal tests....Since the test are internal, I need to send you the fix patch before landing it..

summary: - Change the dtype of the token from int64 to in32 Int32 is Qnn HTP friendly. It will significantly speed up qnn embedding operations due to matching backend optimizations.

facebook-github-bot · 2024-11-10T23:58:41Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-11-11T01:03:38Z

extension/llm/runner/text_token_generator.h

  TextTokenGenerator(
      Tokenizer* tokenizer,
      TextDecoderRunner* text_decoder_runner,
+      bool use_int32_token,


This is causing BC breaking and causes CI failure. Can we set a default value so it's BC compatible? Also assert when it doesn't match metadata.

The CI breaking stack trace can be found here

facebook-github-bot · 2024-11-11T04:37:57Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-11-11T04:38:53Z

extension/llm/runner/text_prefiller.cpp


 TextPrefiller::TextPrefiller(
    TextDecoderRunner* text_decoder_runner,
+    bool use_int32_token,


oh I meant more like

TextPrefiller::TextPrefiller( TextDecoderRunner* text_decoder_runner, bool use_kv_cache, bool enable_parallel_prefill, bool use_int32_token=False)

so we don't need to update all callsite...

Sorry about misunderstanding. I have updated. Thanks!

facebook-github-bot · 2024-11-11T19:07:06Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-11-12T21:31:57Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2025-02-07T00:16:13Z

Is this PR still needed?

shewu-quic · 2025-02-07T06:12:54Z

Is this PR still needed?

I think we are focusing on static llama maybe we can close this PR.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 8, 2024

cccclai approved these changes Nov 8, 2024

View reviewed changes

cccclai reviewed Nov 8, 2024

View reviewed changes

Qualcomm AI Engine Direct - Optimize QNN embedding op for llama

73e2f7c

summary: - Change the dtype of the token from int64 to in32 Int32 is Qnn HTP friendly. It will significantly speed up qnn embedding operations due to matching backend optimizations.

shewu-quic force-pushed the dev1/hutton/optimize_qnn_embedding_op_in_llama branch from ddade6d to 73e2f7c Compare November 10, 2024 12:26

cccclai reviewed Nov 11, 2024

View reviewed changes

set default value for use_int32_token

5aac8b6

shewu-quic force-pushed the dev1/hutton/optimize_qnn_embedding_op_in_llama branch from 8cbb91d to 5aac8b6 Compare November 11, 2024 07:13

shewu-quic closed this Feb 7, 2025

Qualcomm AI Engine Direct - Optimize QNN embedding op for llama #6725

Qualcomm AI Engine Direct - Optimize QNN embedding op for llama #6725

Uh oh!

Conversation

shewu-quic commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6725

✅ No Failures

Uh oh!

shewu-quic commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Nov 8, 2024

Uh oh!

cccclai Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

shewu-quic Nov 10, 2024

Choose a reason for hiding this comment

Uh oh!

cccclai commented Nov 9, 2024

Uh oh!

facebook-github-bot commented Nov 10, 2024

Uh oh!

cccclai Nov 11, 2024

Choose a reason for hiding this comment

Uh oh!

cccclai Nov 11, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 11, 2024

Uh oh!

cccclai Nov 11, 2024

Choose a reason for hiding this comment

Uh oh!

shewu-quic Nov 11, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 11, 2024

Uh oh!

facebook-github-bot commented Nov 12, 2024

Uh oh!

cccclai commented Feb 7, 2025

Uh oh!

shewu-quic commented Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shewu-quic commented Nov 8, 2024 •

edited

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading

shewu-quic commented Nov 8, 2024 •

edited

Loading