Skip to content

Commit 9fad8f5

Browse files
authored
Update README.md
1 parent e15c046 commit 9fad8f5

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

examples/models/voxtral/README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ python -m executorch.extension.audio.mel_spectrogram --feature_size 128 --output
4646
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
4747
The Voxtral runner will do the following things:
4848
1. [Optional] Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
49-
2. [If starting directly with an already processed audio input tensor, starts here] Formats the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
49+
2. [If starting directly with an already processed audio input tensor] Format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
5050
3. Feed the formatted inputs to the multimodal modal runner.
5151

5252
## Building the multimodal runner
@@ -68,6 +68,20 @@ You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](h
6868
--audio_path ~/models/voxtral/audio_input.bin
6969
```
7070

71+
Example output:
72+
```
73+
The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that
74+
the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might
75+
PyTorchObserver {"prompt_tokens":388,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756351346381,"inference_end_ms":1756351362602,"prompt_eval_end_ms":1756351351435,"first_token_ms":1756351351435,"aggregate_sampling_time_ms":99,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
76+
I 00:00:24.036773 executorch:stats.h:104] Prompt Tokens: 388 Generated Tokens: 99
77+
I 00:00:24.036800 executorch:stats.h:110] Model Load Time: 0.000000 (seconds)
78+
I 00:00:24.036805 executorch:stats.h:117] Total inference time: 16.221000 (seconds) Rate: 6.103200 (tokens/second)
79+
I 00:00:24.036815 executorch:stats.h:127] Prompt evaluation: 5.054000 (seconds) Rate: 76.770875 (tokens/second)
80+
I 00:00:24.036819 executorch:stats.h:136] Generated 99 tokens: 11.167000 (seconds) Rate: 8.865407 (tokens/second)
81+
I 00:00:24.036822 executorch:stats.h:147] Time to first generated token: 5.054000 (seconds)
82+
I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds)
83+
```
84+
7185
You can easily produce an `.bin` for the audio input in Python like this:
7286
```
7387
# t = some torch.Tensor

0 commit comments

Comments
 (0)