Add audio to multimodal runner #13662

jackzhxng · 2025-08-25T22:01:33Z

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2025-08-25T22:01:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13662

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fb87bbf with merge base 99e6349 ():

NEW FAILURE - The following job has failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 75f5ea3 Pull Request resolved: #13662

[ghstack-poisoned]

extension/llm/runner/multimodal_input.h

mergennachin · 2025-08-26T14:57:52Z

extension/llm/runner/constants.h

+inline constexpr auto kTokenEmbeddingMethod = "token_embeddings";
+inline constexpr auto kTextModelMethod = "decoder";


Make it backwards compatible...

Keep token_embedding, not token_embeddings

And keep text_model instead of decoder

There's nothing that needs to be kept backwards compatible, this isn't used anywhere atm. I'd like to match this to Optimum

extension/llm/runner/multimodal_input.h

mergennachin · 2025-08-26T15:06:43Z

extension/llm/runner/multimodal_prefiller.cpp

+  // 2. Run decoder model for prefill.
+  // `cache_position` goes from start_pos to start_pos + encoder_output.size(1).
+  // e.g. if start_pos = 2 and encoder_output.size(1) = 5,
+  // cache_position_tensor should be [2, 3, 4, 5, 6].
+  int64_t seq_len = encoder_output.toTensor().size(1);
+  std::vector<int64_t> cache_positions(seq_len);
+  for (int64_t i = 0; i < seq_len; ++i) {
+    cache_positions[i] = start_pos + i;
+  }
+  auto cache_position_tensor = ::executorch::extension::from_blob(
+      cache_positions.data(), {seq_len}, executorch::aten::ScalarType::Long);
+  auto prefill_result = module_->execute(
+      kTextModelMethod, {cache_position_tensor, encoder_output});
+  if (prefill_result.error() != ::executorch::runtime::Error::Ok) {
+    return ::executorch::runtime::Error::Internal;
+  }
+  auto prefill_outputs = prefill_result.get();
+  auto outputs_res = prefill_outputs[0].toTensor();


cc @kimishpatel @JacobSzwejbka is this correct way to manage KV Cache indices?

extension/llm/runner/multimodal_prefiller.cpp

mergennachin · 2025-08-26T15:12:40Z

extension/llm/runner/multimodal_prefiller.cpp

+    return prefill_result.error();
+  }
+  auto prefill_outputs = prefill_result.get();
+  auto outputs_res = prefill_outputs[0].toTensor();


validate if outputs_res.numel() == 0

Why? I think adding so many validations for extremely unlikely outcomes makes things too long and hard to read. I think letting this one naturally error out below and returning that error directly is good enough

ghstack-source-id: 98599b7 Pull Request resolved: #13662

[ghstack-poisoned]

extension/llm/runner/audio.h

extension/llm/runner/multimodal_input.h

mergennachin · 2025-08-26T20:02:38Z

consider adding unit tests similar to what mengwei added in vision-text version

mergennachin · 2025-08-26T20:03:41Z

Also, CI failures look legit

jackzhxng · 2025-08-26T20:18:16Z

Yeah I'm fixing the CI issue

ghstack-source-id: db16778 Pull Request resolved: #13662

[ghstack-poisoned]

extension/llm/runner/audio.h

kimishpatel · 2025-08-26T20:44:45Z

extension/llm/runner/multimodal_input.h

+  Audio& get_audio() & {
+    return std::get<Audio>(data_);
+  }


is this needed? like do we ever return mutable Audio?

Yeah I was thinking the same too, this is following already established pattern, I'm thinking of getting rid of all of these get_ variants later

kimishpatel · 2025-08-26T20:47:22Z

extension/llm/runner/multimodal_prefiller.cpp

+  // 2. Run decoder model for prefill.
+  // `cache_position` goes from start_pos to start_pos + encoder_output.size(1).
+  // e.g. if start_pos = 2 and encoder_output.size(1) = 5,
+  // cache_position_tensor should be [2, 3, 4, 5, 6].
+  int64_t seq_len = encoder_output.toTensor().size(1);


Didnt vision based multimodal need exactly the same thing?

Vision also takes this path

[ghstack-poisoned]

(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13662) Differential Revision: [D81498750](https://our.internmc.facebook.com/intern/diff/D81498750) [ghstack-poisoned]

ghstack-source-id: 9f1fca5 Pull Request resolved: #13662

(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13662) Differential Revision: [D81498750](https://our.internmc.facebook.com/intern/diff/D81498750) [ghstack-poisoned]

Add audio to multimodal runner

127ff2e

[ghstack-poisoned]

jackzhxng requested review from larryliu0820, mergennachin and swolchok as code owners August 25, 2025 22:01

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2025

This was referenced Aug 25, 2025

Add Voxtral runner #13663

Merged

Make token and stat callback optional in multimodal runner #13664

Merged

jackzhxng added a commit that referenced this pull request Aug 26, 2025

Add audio to multimodal runner

c6883b1

ghstack-source-id: 75f5ea3 Pull Request resolved: #13662

Update on "Add audio to multimodal runner"

9520ed2

[ghstack-poisoned]

mergennachin reviewed Aug 26, 2025

View reviewed changes

jackzhxng added a commit that referenced this pull request Aug 26, 2025

Add audio to multimodal runner

b57bae1

ghstack-source-id: 98599b7 Pull Request resolved: #13662

jackzhxng added the release notes: llm Changes to llm utilities label Aug 26, 2025

jackzhxng added a commit that referenced this pull request Aug 26, 2025

Add audio to multimodal runner

78cd144

ghstack-source-id: 98599b7 Pull Request resolved: #13662

Update on "Add audio to multimodal runner"

9d82591

[ghstack-poisoned]

jackzhxng requested a review from mergennachin August 26, 2025 19:53

mergennachin approved these changes Aug 26, 2025

View reviewed changes

extension/llm/runner/audio.h Outdated Show resolved Hide resolved

extension/llm/runner/multimodal_input.h Outdated Show resolved Hide resolved

jackzhxng added a commit that referenced this pull request Aug 26, 2025

Add audio to multimodal runner

c89f07e

ghstack-source-id: db16778 Pull Request resolved: #13662

Update on "Add audio to multimodal runner"

4640124

[ghstack-poisoned]

kimishpatel reviewed Aug 26, 2025

View reviewed changes

extension/llm/runner/audio.h Show resolved Hide resolved

kimishpatel reviewed Aug 26, 2025

View reviewed changes

jackzhxng mentioned this pull request Aug 28, 2025

Include audio preprocessing for raw audio tensor #13752

Merged

jackzhxng added 4 commits August 28, 2025 08:25

Update on "Add audio to multimodal runner"

b9feedf

[ghstack-poisoned]

Update on "Add audio to multimodal runner"

7207d1d

[ghstack-poisoned]

Update on "Add audio to multimodal runner"

be6eb00

[ghstack-poisoned]

Update on "Add audio to multimodal runner"

fb87bbf

[ghstack-poisoned]

jackzhxng requested a review from lucylq as a code owner August 28, 2025 19:36

kirklandsign self-requested a review August 28, 2025 20:21

jackzhxng merged commit fb87bbf into gh/jackzhxng/30/base Sep 2, 2025
111 of 112 checks passed

jackzhxng deleted the gh/jackzhxng/30/head branch September 2, 2025 02:54

jackzhxng temporarily deployed to cherry-pick-bot September 2, 2025 02:54 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Sep 2, 2025

Add audio to multimodal runner #13851

Merged

jackzhxng mentioned this pull request Sep 2, 2025

Add audio to multimodal runner #13870

Merged

kirklandsign pushed a commit that referenced this pull request Sep 3, 2025

Add audio to multimodal runner (#13851)

3f2f7d6

ghstack-source-id: 9f1fca5 Pull Request resolved: #13662

		inline constexpr auto kTokenEmbeddingMethod = "token_embeddings";
		inline constexpr auto kTextModelMethod = "decoder";

Add audio to multimodal runner #13662

Add audio to multimodal runner #13662

Uh oh!

Conversation

jackzhxng commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13662

❌ 1 New Failure

Uh oh!

Uh oh!

Uh oh!

mergennachin Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergennachin Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergennachin commented Aug 26, 2025

Uh oh!

jackzhxng commented Aug 26, 2025

Uh oh!

Uh oh!

kimishpatel Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimishpatel Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jackzhxng commented Aug 25, 2025 •

edited

Loading

pytorch-bot bot commented Aug 25, 2025 •

edited

Loading

jackzhxng Aug 26, 2025 •

edited

Loading

mergennachin commented Aug 26, 2025 •

edited

Loading

jackzhxng Aug 26, 2025 •

edited

Loading