Add a generic multimodal runner #13166

larryliu0820 · 2025-08-06T19:36:09Z

Summary:
This diff adds a generic multimodal runner for Executorch. It includes changes to the image_prefiller.h file, which adds a prefill method that takes an Image object and returns the next token of the LLM module after prefill. It also includes changes to the multimodal_runner.cpp file, which implements the MultimodalRunner class for multimodal input and text output LLMs. The MultimodalRunner class uses the ImagePrefiller, TextPrefiller classes to prefill the KV cache of the model, then uses TextTokenGenerator to run the autoregressive generation loop.

See diagram:

      ┌─────────────────┐
      │     IRunner     │
      │   <<interface>> │
      │                 │
      │ + is_loaded()   │
      │ + load()        │
      │ + generate()    │
      │ + stop()        │
      └─────────────────┘
             △
             │
             │ implements
             │
             │
             │
             │
      ┌──────┴──────────┐          ┌─────────────────┐
      │ TextLLMRunner   │          │MultimodalRunner │
      │                 │          │                 │
      │ - tokenizer_    │          │ - tokenizer_    ┼───────┐
┌─────┼ - module_       │          │ - module_       ┼─────┐ │
│ ┌───┼ - stats_        │          │ - stats_        ┼───┐ │ │
│ │ ┌─┼ - metadata_     │          │ - metadata_     ┼─┐ │ │ │
│ │ │ │ - temperature_  │          │ - pos_          │ │ │ │ │
│ │ │ └─────────────────┘          └─────────────────┘ │ │ │ │
│ │ │                                                  │ │ │ │
│ │ │                                                  │ │ │ │
│ │ │                                                  │ │ │ │
│ │ │               ┌─────────────────┐                │ │ │ │
│ │ │               │TextTokenGenerat-│                │ │ │ │
│ │ │               │or               │                │ │ │ │
│ │ │               │                 │                │ │ │ │
│ │ │               │ - tokenizer_*   │                │ │ │ │
│ │ │  consists     │ - text_decoder_ │    consists    │ │ │ │
│ │ └──────────────►│   runner_       │◄───────────────┘ │ │ │
│ │                 │ - eos_ids_      │                  │ │ │
│ │                 │ - use_kv_cache_ │                  │ │ │
│ │                 │ - stats_*       │                  │ │ │
│ │                 │                 │                  │ │ │
│ │consists         │ + generate()    │         consists │ │ │
│ │                 └────────┬────────┘                  │ │ │
│ │           ┌──────────────┴───────────────┐           │ │ │
│ │           ▼            uses              ▼           │ │ │
│ │   ┌─────────────────┐          ┌─────────────────┐   │ │ │
│ │   │TextDecoderRunner│          │MultimodalTextDe-│   │ │ │
│ │   │                 │          │coderRunner      │   │ │ │
│ │   │ - module_*      │ extends  │ - module_*      │   │ │ │
│ └──►│ - should_stop_  │◄─────────┼ - should_stop_  │◄──┘ │ │
│     │                 │          │                 │     │ │
│     │ + step()        │          │ + step()        │     │ │
│     │ + logits_to_    │          │ + logits_to_    │     │ │
│     │   token()       │          │   token()       │     │ │
│     └─────────────────┘          └─────────────────┘     │ │
│             ▲                             ▲              │ │
│             │           uses              │              │ │
│             └──────────────┬──────────────┘              │ │
│                    ┌───────┴─────────┐                   │ │
│                    │  TextPrefiller  │                   │ │
│                    │                 │                   │ │
│                    │ - text_decoder_ │                   │ │
│   consists         │   runner_       │      consists     │ │
└───────────────────►│ - use_kv_cache_ │◄──────────────────┘ │
                     │ - enable_       │                     │
                     │   parallel_     │                     │
                     │   prefill_      │                     │
                     │                 │                     │
                     │ + prefill()     │                     │
                     └─────────────────┘           consists  │
                                                             │
                                                             │
                                   ┌─────────────────┐       │
                                   │ ImagePrefiller  │       │
                                   │                 │       │
                                   │ - module_*      │       │
                                   │                 │◄──────┘
                                   │ + prefill()     │
                                   │ + logits_to_    │
                                   │   token()       │
                                   └─────────────────┘

Differential Revision: D79231625

pytorch-bot · 2025-08-06T19:36:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13166

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Pending

As of commit 5c0ad9b with merge base d757709 ():

NEW FAILURE - The following job has failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-08-06T19:36:21Z

This pull request was exported from Phabricator. Differential Revision: D79231625

extension/llm/runner/multimodal_input.h

extension/llm/runner/multimodal_runner.cpp

jackzhxng

Pr not synced with internal diff so ignore these PR comments, made some comments to address internally. Otherwise looks good

extension/llm/runner/README.md

extension/llm/runner/multimodal_input.h

extension/llm/runner/multimodal_runner.cpp

Summary: This diff adds a generic multimodal runner for Executorch. It includes changes to the `multimodal_prefiller.h` file, which adds a `prefill` method that takes an `MultimodalInput` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `MultimodalRunner` class to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ │ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ └────────┬────────┘ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ uses ▼ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │TextDecoderRunner│ │MultimodalDecode-│ │ │ │ │ │ │ │rRunner │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ token() │ │ token() │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ uses │ │ │consists ├─────────────────────────────┤ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ consists│ │ │ │ ┌────────┴────────┐ │ │ │ - text_decoder_ │ │ MultimodalPrefi-│ │ │ │ runner_ │ │ller │ │ └────►│ - use_kv_cache_ │ │ - module_* │ │ │ - enable_ │ │ │◄────┘ │ parallel_ │ │ + prefill() │ │ prefill_ │ │ + logits_to_ │ │ │ │ token() │ │ + prefill() │ └─────────────────┘ ├─────────────────┘ ``` Reviewed By: jackzhxng Differential Revision: D79231625

facebook-github-bot · 2025-08-18T23:09:04Z

This pull request was exported from Phabricator. Differential Revision: D79231625

Summary: This diff adds a generic multimodal runner for Executorch. It includes changes to the `multimodal_prefiller.h` file, which adds a `prefill` method that takes an `MultimodalInput` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `MultimodalRunner` class to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ │ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ └────────┬────────┘ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ uses ▼ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │TextDecoderRunner│ │MultimodalDecode-│ │ │ │ │ │ │ │rRunner │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ token() │ │ token() │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ uses │ │ │consists ├─────────────────────────────┤ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ consists│ │ │ │ ┌────────┴────────┐ │ │ │ - text_decoder_ │ │ MultimodalPrefi-│ │ │ │ runner_ │ │ller │ │ └────►│ - use_kv_cache_ │ │ - module_* │ │ │ - enable_ │ │ │◄────┘ │ parallel_ │ │ + prefill() │ │ prefill_ │ │ + logits_to_ │ │ │ │ token() │ │ + prefill() │ └─────────────────┘ ├─────────────────┘ ``` Differential Revision: D79231625

facebook-github-bot · 2025-08-19T07:03:55Z

This pull request was exported from Phabricator. Differential Revision: D79231625

Summary: Pull Request resolved: #13166 This diff adds a generic multimodal runner for Executorch. It includes changes to the `multimodal_prefiller.h` file, which adds a `prefill` method that takes an `MultimodalInput` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `MultimodalRunner` class to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ │ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ └────────┬────────┘ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ uses ▼ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │TextDecoderRunner│ │MultimodalDecode-│ │ │ │ │ │ │ │rRunner │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ token() │ │ token() │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ uses │ │ │consists ├─────────────────────────────┤ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ consists│ │ │ │ ┌────────┴────────┐ │ │ │ - text_decoder_ │ │ MultimodalPrefi-│ │ │ │ runner_ │ │ller │ │ └────►│ - use_kv_cache_ │ │ - module_* │ │ │ - enable_ │ │ │◄────┘ │ parallel_ │ │ + prefill() │ │ prefill_ │ │ + logits_to_ │ │ │ │ token() │ │ + prefill() │ └─────────────────┘ ├─────────────────┘ ``` Differential Revision: D79231625

Summary: This diff adds a generic multimodal runner for Executorch. It includes changes to the `multimodal_prefiller.h` file, which adds a `prefill` method that takes an `MultimodalInput` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `MultimodalRunner` class to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ │ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ └────────┬────────┘ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ uses ▼ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │TextDecoderRunner│ │MultimodalDecode-│ │ │ │ │ │ │ │rRunner │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ token() │ │ token() │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ uses │ │ │consists ├─────────────────────────────┤ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ consists│ │ │ │ ┌────────┴────────┐ │ │ │ - text_decoder_ │ │ MultimodalPrefi-│ │ │ │ runner_ │ │ller │ │ └────►│ - use_kv_cache_ │ │ - module_* │ │ │ - enable_ │ │ │◄────┘ │ parallel_ │ │ + prefill() │ │ prefill_ │ │ + logits_to_ │ │ │ │ token() │ │ + prefill() │ └─────────────────┘ ├─────────────────┘ ``` Differential Revision: D79231625

facebook-github-bot · 2025-08-19T07:42:52Z

This pull request was exported from Phabricator. Differential Revision: D79231625

facebook-github-bot · 2025-08-19T07:43:04Z

This pull request was exported from Phabricator. Differential Revision: D79231625

Summary: Pull Request resolved: #13166 This diff adds a generic multimodal runner for Executorch. It includes changes to the `multimodal_prefiller.h` file, which adds a `prefill` method that takes an `MultimodalInput` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `MultimodalRunner` class to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ │ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ └────────┬────────┘ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ uses ▼ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │TextDecoderRunner│ │MultimodalDecode-│ │ │ │ │ │ │ │rRunner │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ token() │ │ token() │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ uses │ │ │consists ├─────────────────────────────┤ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ consists│ │ │ │ ┌────────┴────────┐ │ │ │ - text_decoder_ │ │ MultimodalPrefi-│ │ │ │ runner_ │ │ller │ │ └────►│ - use_kv_cache_ │ │ - module_* │ │ │ - enable_ │ │ │◄────┘ │ parallel_ │ │ + prefill() │ │ prefill_ │ │ + logits_to_ │ │ │ │ token() │ │ + prefill() │ └─────────────────┘ ├─────────────────┘ ``` Differential Revision: D79231625

facebook-github-bot · 2025-08-19T07:43:10Z

This pull request was exported from Phabricator. Differential Revision: D79231625

Summary: Pull Request resolved: #13166 This diff adds a generic multimodal runner for Executorch. It includes changes to the `multimodal_prefiller.h` file, which adds a `prefill` method that takes an `MultimodalInput` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `MultimodalRunner` class to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ │ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ └────────┬────────┘ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ uses ▼ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │TextDecoderRunner│ │MultimodalDecode-│ │ │ │ │ │ │ │rRunner │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ token() │ │ token() │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ uses │ │ │consists ├─────────────────────────────┤ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ consists│ │ │ │ ┌────────┴────────┐ │ │ │ - text_decoder_ │ │ MultimodalPrefi-│ │ │ │ runner_ │ │ller │ │ └────►│ - use_kv_cache_ │ │ - module_* │ │ │ - enable_ │ │ │◄────┘ │ parallel_ │ │ + prefill() │ │ prefill_ │ │ + logits_to_ │ │ │ │ token() │ │ + prefill() │ └─────────────────┘ ├─────────────────┘ ``` Differential Revision: D79231625

mergennachin

Very nice!

mergennachin · 2025-08-19T15:34:04Z

We should make Llava runner use this new API as next step

https://github.com/pytorch/executorch/tree/main/examples/models/llava/runner

facebook-github-bot · 2025-08-19T15:46:39Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D79231625.

Summary: This diff adds a generic multimodal runner for Executorch. It includes changes to the `image_prefiller.h` file, which adds a `prefill` method that takes an `Image` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `ImagePrefiller`, `TextPrefiller` classes to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ ┼───────┐ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ │ └────────┬────────┘ │ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ │ ▼ uses ▼ │ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ │TextDecoderRunner│ │MultimodalTextDe-│ │ │ │ │ │ │ │ │coderRunner │ │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ │ token() │ │ token() │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ ▲ ▲ │ │ │ │ uses │ │ │ │ └──────────────┬──────────────┘ │ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ │ │ │ │ │ │ │ │ - text_decoder_ │ │ │ │ consists │ runner_ │ consists │ │ └───────────────────►│ - use_kv_cache_ │◄──────────────────┘ │ │ - enable_ │ │ │ parallel_ │ │ │ prefill_ │ │ │ │ │ │ + prefill() │ │ └─────────────────┘ consists │ │ │ ┌─────────────────┐ │ │ ImagePrefiller │ │ │ │ │ │ - module_* │ │ │ │◄──────┘ │ + prefill() │ │ + logits_to_ │ │ token() │ └─────────────────┘ ``` Test Plan: Imported from GitHub, without a `Test Plan:` line. Rollback Plan: Reviewed By: mergennachin Differential Revision: D79231625 Pulled By: larryliu0820

Summary: This diff adds a generic multimodal runner for Executorch. It includes changes to the `image_prefiller.h` file, which adds a `prefill` method that takes an `Image` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `ImagePrefiller`, `TextPrefiller` classes to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ ┼───────┐ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ │ └────────┬────────┘ │ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ │ ▼ uses ▼ │ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ │TextDecoderRunner│ │MultimodalTextDe-│ │ │ │ │ │ │ │ │coderRunner │ │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ │ token() │ │ token() │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ ▲ ▲ │ │ │ │ uses │ │ │ │ └──────────────┬──────────────┘ │ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ │ │ │ │ │ │ │ │ - text_decoder_ │ │ │ │ consists │ runner_ │ consists │ │ └───────────────────►│ - use_kv_cache_ │◄──────────────────┘ │ │ - enable_ │ │ │ parallel_ │ │ │ prefill_ │ │ │ │ │ │ + prefill() │ │ └─────────────────┘ consists │ │ │ ┌─────────────────┐ │ │ ImagePrefiller │ │ │ │ │ │ - module_* │ │ │ │◄──────┘ │ + prefill() │ │ + logits_to_ │ │ token() │ └─────────────────┘ ``` Pull Request resolved: #13166 Test Plan: Imported from GitHub, without a `Test Plan:` line. Rollback Plan: Reviewed By: mergennachin Differential Revision: D79231625 Pulled By: larryliu0820

facebook-github-bot · 2025-08-19T16:34:33Z

This pull request was exported from Phabricator. Differential Revision: D79231625

Summary: This diff adds a generic multimodal runner for Executorch. It includes changes to the `image_prefiller.h` file, which adds a `prefill` method that takes an `Image` object and returns the next token of the LLM module after prefill. It also includes changes to the `multimodal_runner.cpp` file, which implements the `MultimodalRunner` class for multimodal input and text output LLMs. The `MultimodalRunner` class uses the `ImagePrefiller`, `TextPrefiller` classes to prefill the KV cache of the model, then uses `TextTokenGenerator` to run the autoregressive generation loop. See diagram: ``` ┌─────────────────┐ │ IRunner │ │ <<interface>> │ │ │ │ + is_loaded() │ │ + load() │ │ + generate() │ │ + stop() │ └─────────────────┘ △ │ │ implements │ │ │ │ ┌──────┴──────────┐ ┌─────────────────┐ │ TextLLMRunner │ │MultimodalRunner │ │ │ │ │ │ - tokenizer_ │ │ - tokenizer_ ┼───────┐ ┌─────┼ - module_ │ │ - module_ ┼─────┐ │ │ ┌───┼ - stats_ │ │ - stats_ ┼───┐ │ │ │ │ ┌─┼ - metadata_ │ │ - metadata_ ┼─┐ │ │ │ │ │ │ │ - temperature_ │ │ - pos_ │ │ │ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ │ │ │TextTokenGenerat-│ │ │ │ │ │ │ │ │or │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - tokenizer_* │ │ │ │ │ │ │ │ consists │ - text_decoder_ │ consists │ │ │ │ │ │ └──────────────►│ runner_ │◄───────────────┘ │ │ │ │ │ │ - eos_ids_ │ │ │ │ │ │ │ - use_kv_cache_ │ │ │ │ │ │ │ - stats_* │ │ │ │ │ │ │ │ │ │ │ │ │consists │ + generate() │ consists │ │ │ │ │ └────────┬────────┘ │ │ │ │ │ ┌──────────────┴───────────────┐ │ │ │ │ │ ▼ uses ▼ │ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ │TextDecoderRunner│ │MultimodalTextDe-│ │ │ │ │ │ │ │ │coderRunner │ │ │ │ │ │ │ - module_* │ extends │ - module_* │ │ │ │ │ └──►│ - should_stop_ │◄─────────┼ - should_stop_ │◄──┘ │ │ │ │ │ │ │ │ │ │ │ + step() │ │ + step() │ │ │ │ │ + logits_to_ │ │ + logits_to_ │ │ │ │ │ token() │ │ token() │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ ▲ ▲ │ │ │ │ uses │ │ │ │ └──────────────┬──────────────┘ │ │ │ ┌───────┴─────────┐ │ │ │ │ TextPrefiller │ │ │ │ │ │ │ │ │ │ - text_decoder_ │ │ │ │ consists │ runner_ │ consists │ │ └───────────────────►│ - use_kv_cache_ │◄──────────────────┘ │ │ - enable_ │ │ │ parallel_ │ │ │ prefill_ │ │ │ │ │ │ + prefill() │ │ └─────────────────┘ consists │ │ │ ┌─────────────────┐ │ │ ImagePrefiller │ │ │ │ │ │ - module_* │ │ │ │◄──────┘ │ + prefill() │ │ + logits_to_ │ │ token() │ └─────────────────┘ ``` Differential Revision: D79231625

larryliu0820 requested review from jackzhxng, mergennachin and swolchok as code owners August 6, 2025 19:36

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2025

facebook-github-bot added the fb-exported label Aug 6, 2025

msluszniak reviewed Aug 8, 2025

View reviewed changes

extension/llm/runner/multimodal_input.h Outdated Show resolved Hide resolved

extension/llm/runner/multimodal_input.h Outdated Show resolved Hide resolved

extension/llm/runner/multimodal_runner.cpp Show resolved Hide resolved

jackzhxng approved these changes Aug 15, 2025

View reviewed changes

larryliu0820 requested a review from msluszniak August 16, 2025 00:22

msluszniak reviewed Aug 16, 2025

View reviewed changes

extension/llm/runner/multimodal_input.h Outdated Show resolved Hide resolved

extension/llm/runner/multimodal_input.h Show resolved Hide resolved

extension/llm/runner/multimodal_runner.cpp Outdated Show resolved Hide resolved

facebook-github-bot force-pushed the export-D79231625 branch from 7508ccb to 69e5c40 Compare August 18, 2025 23:08

larryliu0820 force-pushed the export-D79231625 branch from 69e5c40 to d7c4ad3 Compare August 19, 2025 06:59

larryliu0820 force-pushed the export-D79231625 branch from d7c4ad3 to bf0be56 Compare August 19, 2025 07:04

facebook-github-bot force-pushed the export-D79231625 branch from bf0be56 to 35f273b Compare August 19, 2025 07:34

jackzhxng force-pushed the export-D79231625 branch from 35f273b to 990bbfb Compare August 19, 2025 07:43

larryliu0820 force-pushed the export-D79231625 branch from 168ca09 to 93f1e3d Compare August 19, 2025 15:17

mergennachin approved these changes Aug 19, 2025

View reviewed changes

larryliu0820 force-pushed the export-D79231625 branch from 93f1e3d to 80a87b8 Compare August 19, 2025 16:30

larryliu0820 force-pushed the export-D79231625 branch from 80a87b8 to 5c0ad9b Compare August 19, 2025 16:34

larryliu0820 merged commit 83749ae into main Aug 19, 2025
104 of 106 checks passed

larryliu0820 deleted the export-D79231625 branch August 19, 2025 17:29

jackzhxng added the release notes: multimodal Changes and new features for multimodal support label Sep 4, 2025

Add a generic multimodal runner #13166

Add a generic multimodal runner #13166

Uh oh!

Conversation

larryliu0820 commented Aug 6, 2025

Uh oh!

pytorch-bot bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13166

❌ 1 New Failure, 2 Pending

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackzhxng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Aug 18, 2025

Uh oh!

facebook-github-bot commented Aug 19, 2025

Uh oh!

facebook-github-bot commented Aug 19, 2025

Uh oh!

facebook-github-bot commented Aug 19, 2025

Uh oh!

facebook-github-bot commented Aug 19, 2025

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Aug 19, 2025

Uh oh!

facebook-github-bot commented Aug 19, 2025

Uh oh!

facebook-github-bot commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Aug 6, 2025 •

edited

Loading

jackzhxng left a comment •

edited

Loading