Skip to content

Commit 51b3329

Browse files
authored
Fix voxtral instructions (#14026)
1 parent f01198f commit 51b3329

File tree

2 files changed

+36
-22
lines changed

2 files changed

+36
-22
lines changed

examples/models/voxtral/CMakeLists.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ else()
2020
set(CMAKE_TOOLCHAIN_IOS OFF)
2121
endif()
2222

23+
if(NOT CMAKE_CXX_STANDARD)
24+
set(CMAKE_CXX_STANDARD 17)
25+
endif()
26+
2327
# Let files say "include <executorch/path/to/header.h>"
2428
set(_common_include_directories ${EXECUTORCH_ROOT}/..)
2529

examples/models/voxtral/README.md

Lines changed: 32 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -26,34 +26,37 @@ pip install git+https://github.com/huggingface/transformers@6121e9e46c4fc4e5c91d
2626
## Using the export CLI
2727
We export Voxtral using the Optimum CLI, which will export `model.pte` to the `voxtral` output directory:
2828
```
29-
optimum-cli export executorch
30-
--model "mistralai/Voxtral-Mini-3B-2507"
31-
--task "multimodal-text-to-text"
32-
--recipe "xnnpack"
33-
--use_custom_sdpa
34-
--use_custom_kv_cache
35-
--qlinear 8da4w
36-
--qembedding 4w
37-
--output_dir="voxtral
29+
optimum-cli export executorch \
30+
--model "mistralai/Voxtral-Mini-3B-2507" \
31+
--task "multimodal-text-to-text" \
32+
--recipe "xnnpack" \
33+
--use_custom_sdpa \
34+
--use_custom_kv_cache \
35+
--qlinear 8da4w \
36+
--qembedding 4w \
37+
--output_dir="voxtral"
3838
```
3939

4040
This exports Voxtral with XNNPack backend acceleration and 4-bit weight/8-bit activation linear quantization.
4141

42-
# [Optional] Exporting the audio preprocessor
42+
# Running the model
43+
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
44+
The Voxtral runner will do the following things:
45+
46+
- Audio Input:
47+
- Option A: Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
48+
- Option B: If starting directly with an already processed audio input tensor, format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
49+
- Feed the formatted inputs to the multimodal modal runner.
50+
51+
52+
# [Option A] Exporting the audio preprocessor
4353
The exported model takes in a mel spectrogram input tensor as its audio inputs.
4454
We provide a simple way to transform raw audio data into a mel spectrogram by exporting a version of Voxtral's audio preprocessor used directly by Transformers.
4555

4656
```
4757
python -m executorch.extension.audio.mel_spectrogram --feature_size 128 --output_file voxtral_preprocessor.pte
4858
```
4959

50-
# Running the model
51-
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
52-
The Voxtral runner will do the following things:
53-
1. [Optional] Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
54-
2. [If starting directly with an already processed audio input tensor] Format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
55-
3. Feed the formatted inputs to the multimodal modal runner.
56-
5760
## Building the multimodal runner
5861
```
5962
# Build and install ExecuTorch
@@ -66,11 +69,12 @@ cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Re
6669
## Running the model
6770
You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
6871
```
69-
./cmake-out/examples/models/voxtral/voxtral_runner
70-
--model_path voxtral/model.pte
71-
--tokenizer_path path/to/tekken.json
72-
--prompt "What can you tell me about this audio?"
73-
--audio_path ~/models/voxtral/audio_input.bin
72+
./cmake-out/examples/models/voxtral/voxtral_runner \
73+
--model_path path/to/model.pte \
74+
--tokenizer_path path/to/tekken.json \
75+
--prompt "What can you tell me about this audio?" \
76+
--audio_path path/to/audio_input.bin \
77+
--processor_path path/to/voxtral_preprocessor.pte # If you're passing raw audio file in audio_path
7478
```
7579

7680
Example output:
@@ -93,3 +97,9 @@ You can easily produce an `.bin` for the audio input in Python like this:
9397
with open("tensor.bin", "wb") as f:
9498
f.write(t.numpy().tobytes())
9599
```
100+
101+
You can also produce raw audio file as follows (for Option A):
102+
103+
```
104+
ffmpeg -i audio.mp3 -f f32le -acodec pcm_f32le audio_input.bin
105+
```

0 commit comments

Comments
 (0)