Skip to content

Conversation

@shewu-quic
Copy link
Collaborator

@shewu-quic shewu-quic commented Nov 8, 2024

summary:

  • Change the dtype of the token from int64 to in32 since Int32 is Qnn HTP friendly. It will allow the embedding op to run into backend optimizations to significantly speed up qnn embedding operations.

Test the PR for llama 3.2 1B instruct with seq_len=512 on SM8650
image
Test the mainline
image

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6725

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5aac8b6 with merge base 39e5b91 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 8, 2024
@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Nov 8, 2024

Hi @cccclai,
This PR is to change the dtype of token to optimize embedding op.
Could you help take a look?
And then, I will create the following PRs which speeds up the performance of llama 3.2 1B/3B with QNN Backend ASAP.

  1. Delegate mutable buffer
  2. Enable llama3.2 on static llama

Thank you for your effort.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

return (
torch.tensor(
[[1]], dtype=torch.long
[[1]], dtype=torch.int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put it in model metadata and get it in runtime to avoid misuse

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I miss this change. Yes, I just would like to change it with QNN Backend.

@cccclai
Copy link
Contributor

cccclai commented Nov 9, 2024

Hi this PR breaks quite a few internal tests....Since the test are internal, I need to send you the fix patch before landing it..

summary:
- Change the dtype of the token from int64 to in32
Int32 is Qnn HTP friendly. It will significantly speed up qnn embedding operations due to matching backend optimizations.
@shewu-quic shewu-quic force-pushed the dev1/hutton/optimize_qnn_embedding_op_in_llama branch from ddade6d to 73e2f7c Compare November 10, 2024 12:26
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

TextTokenGenerator(
Tokenizer* tokenizer,
TextDecoderRunner* text_decoder_runner,
bool use_int32_token,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing BC breaking and causes CI failure. Can we set a default value so it's BC compatible? Also assert when it doesn't match metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI breaking stack trace can be found here

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


TextPrefiller::TextPrefiller(
TextDecoderRunner* text_decoder_runner,
bool use_int32_token,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I meant more like

TextPrefiller::TextPrefiller(
    TextDecoderRunner* text_decoder_runner,
    bool use_kv_cache,
    bool enable_parallel_prefill,
    bool use_int32_token=False)

so we don't need to update all callsite...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about misunderstanding. I have updated. Thanks!

@shewu-quic shewu-quic force-pushed the dev1/hutton/optimize_qnn_embedding_op_in_llama branch from 8cbb91d to 5aac8b6 Compare November 11, 2024 07:13
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Feb 7, 2025

Is this PR still needed?

@shewu-quic
Copy link
Collaborator Author

Is this PR still needed?

I think we are focusing on static llama maybe we can close this PR.

@shewu-quic shewu-quic closed this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants