-
Notifications
You must be signed in to change notification settings - Fork 698
Add audio to multimodal runner #13662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13662
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit fb87bbf with merge base 99e6349 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
extension/llm/runner/constants.h
Outdated
| inline constexpr auto kTokenEmbeddingMethod = "token_embeddings"; | ||
| inline constexpr auto kTextModelMethod = "decoder"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it backwards compatible...
Keep token_embedding, not token_embeddings
And keep text_model instead of decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing that needs to be kept backwards compatible, this isn't used anywhere atm. I'd like to match this to Optimum
| // 2. Run decoder model for prefill. | ||
| // `cache_position` goes from start_pos to start_pos + encoder_output.size(1). | ||
| // e.g. if start_pos = 2 and encoder_output.size(1) = 5, | ||
| // cache_position_tensor should be [2, 3, 4, 5, 6]. | ||
| int64_t seq_len = encoder_output.toTensor().size(1); | ||
| std::vector<int64_t> cache_positions(seq_len); | ||
| for (int64_t i = 0; i < seq_len; ++i) { | ||
| cache_positions[i] = start_pos + i; | ||
| } | ||
| auto cache_position_tensor = ::executorch::extension::from_blob( | ||
| cache_positions.data(), {seq_len}, executorch::aten::ScalarType::Long); | ||
| auto prefill_result = module_->execute( | ||
| kTextModelMethod, {cache_position_tensor, encoder_output}); | ||
| if (prefill_result.error() != ::executorch::runtime::Error::Ok) { | ||
| return ::executorch::runtime::Error::Internal; | ||
| } | ||
| auto prefill_outputs = prefill_result.get(); | ||
| auto outputs_res = prefill_outputs[0].toTensor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kimishpatel @JacobSzwejbka is this correct way to manage KV Cache indices?
| return prefill_result.error(); | ||
| } | ||
| auto prefill_outputs = prefill_result.get(); | ||
| auto outputs_res = prefill_outputs[0].toTensor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate if outputs_res.numel() == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I think adding so many validations for extremely unlikely outcomes makes things too long and hard to read. I think letting this one naturally error out below and returning that error directly is good enough
[ghstack-poisoned]
|
consider adding unit tests similar to what mengwei added in vision-text version |
|
Also, CI failures look legit |
|
Yeah I'm fixing the CI issue |
[ghstack-poisoned]
| Audio& get_audio() & { | ||
| return std::get<Audio>(data_); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed? like do we ever return mutable Audio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking the same too, this is following already established pattern, I'm thinking of getting rid of all of these get_ variants later
| // 2. Run decoder model for prefill. | ||
| // `cache_position` goes from start_pos to start_pos + encoder_output.size(1). | ||
| // e.g. if start_pos = 2 and encoder_output.size(1) = 5, | ||
| // cache_position_tensor should be [2, 3, 4, 5, 6]. | ||
| int64_t seq_len = encoder_output.toTensor().size(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt vision based multimodal need exactly the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vision also takes this path
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13662) Differential Revision: [D81498750](https://our.internmc.facebook.com/intern/diff/D81498750) [ghstack-poisoned]
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13662) Differential Revision: [D81498750](https://our.internmc.facebook.com/intern/diff/D81498750) [ghstack-poisoned]
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13662) Differential Revision: [D81498750](https://our.internmc.facebook.com/intern/diff/D81498750) [ghstack-poisoned]
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13662) Differential Revision: [D81498750](https://our.internmc.facebook.com/intern/diff/D81498750) [ghstack-poisoned]
Stack from ghstack (oldest at bottom):