- 
                Notifications
    You must be signed in to change notification settings 
- Fork 698
Add Voxtral runner #13663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Voxtral runner #13663
Conversation
[ghstack-poisoned]
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13663
 Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit f9c1771 with merge base 99e6349 ( NEW FAILURE - The following job has failed:
 
 This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
| size_t bos_token_index, | ||
| size_t eos_token_index) { | ||
| runtime::runtime_init(); | ||
| auto tekken_tokenizer = std::make_unique<tokenizers::Tekken>(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow what this is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hf tokenizer can "load" the tekken tokenizer since it's also a json, which we don't want since the pattern I see here is that we keep loading different tokenizers until one works
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` [ghstack-poisoned]
| // Prepare inputs | ||
| std::vector<MultimodalInput> inputs; | ||
|  | ||
| // 1. Add start bos-related text inputs and modality start token. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is show we run audio inputs with multimodal runner?
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663) Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749) [ghstack-poisoned]
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663) Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749) [ghstack-poisoned]
### Summary Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126). The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner. Example output: ``` This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says " PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:33.116291 executorch:stats.h:104] Prompt Tokens: 1138 Generated Tokens: 99 I 00:00:33.116304 executorch:stats.h:110] Model Load Time: 0.000000 (seconds) I 00:00:33.116312 executorch:stats.h:117] Total inference time: 25.274000 (seconds) Rate: 3.917069 (tokens/second) I 00:00:33.116320 executorch:stats.h:127] Prompt evaluation: 12.169000 (seconds) Rate: 93.516312 (tokens/second) I 00:00:33.116327 executorch:stats.h:136] Generated 99 tokens: 13.105000 (seconds) Rate: 7.554369 (tokens/second) I 00:00:33.116338 executorch:stats.h:147] Time to first generated token: 12.169000 (seconds) I 00:00:33.116344 executorch:stats.h:153] Sampling time over 1237 tokens: 0.096000 (seconds) ``` ### Test plan Build and run: ``` # Build and install ExecuTorch cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release # Build and install Voxtral runner cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release # Run Voxtral runner ./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin ``` Pull Request resolved: #13663
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663) Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749) [ghstack-poisoned]
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663) Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749) [ghstack-poisoned]
Summary
Utilize
multimodal_runner.hto run Voxtral exported from Optimum Executorch.The runner takes in a
.ptfile of a preprocessed audio recording and feeds it a C++ multimodal runner.Example output:
Test plan
Build and run:
Stack from ghstack (oldest at bottom):