Skip to content

Commit 8711ebd

Browse files
manuelcandalesGregoryComer
authored andcommitted
Add Metal backend documentation to Voxtral README (pytorch#15273)
This PR updates the Voxtral README to document Metal backend support on Apple Silicon.
1 parent 146c8cb commit 8711ebd

File tree

1 file changed

+133
-4
lines changed

1 file changed

+133
-4
lines changed

examples/models/voxtral/README.md

Lines changed: 133 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,64 @@ optimum-cli export executorch \
3636

3737
This exports Voxtral with XNNPack backend acceleration and 4-bit weight/8-bit activation linear quantization.
3838

39+
## CUDA Support
40+
If your environment has CUDA support, you can enable the runner to run on CUDA for improved performance. Follow the export and runtime commands below:
41+
42+
### Exporting with CUDA
43+
```
44+
optimum-cli export executorch \
45+
--model "mistralai/Voxtral-Mini-3B-2507" \
46+
--task "multimodal-text-to-text" \
47+
--recipe "cuda" \
48+
--dtype bfloat16 \
49+
--device cuda \
50+
--max_seq_len 1024 \
51+
--output_dir="voxtral"
52+
```
53+
54+
This will generate:
55+
- `model.pte` - The exported model
56+
- `aoti_cuda_blob.ptd` - The CUDA kernel blob required for runtime
57+
58+
Furthermore, we support several quantization formats on CUDA.
59+
For example, to export Voxtral with int4 weight and int4mm for linear layers, you can use the following command,
60+
```
61+
optimum-cli export executorch \
62+
--model "mistralai/Voxtral-Mini-3B-2507" \
63+
--task "multimodal-text-to-text" \
64+
--recipe "cuda" \
65+
--dtype bfloat16 \
66+
--device cuda \
67+
--max_seq_len 1024 \
68+
--qlinear 4w \
69+
--qlinear_encoder 4w \
70+
--qlinear_packing_format tile_packed_to_4d \
71+
--qlinear_encoder_packing_format tile_packed_to_4d \
72+
--output_dir="voxtral"
73+
```
74+
75+
See the "Building the multimodal runner" section below for instructions on building with CUDA support, and the "Running the model" section for runtime instructions.
76+
77+
## Metal Support
78+
On Apple Silicon, you can enable the runner to run on Metal. Follow the export and runtime commands below:
79+
80+
### Exporting with Metal
81+
```
82+
optimum-cli export executorch \
83+
--model "mistralai/Voxtral-Mini-3B-2507" \
84+
--task "multimodal-text-to-text" \
85+
--recipe "metal" \
86+
--dtype bfloat16 \
87+
--max_seq_len 1024 \
88+
--output_dir="voxtral"
89+
```
90+
91+
This will generate:
92+
- `model.pte` - The exported model
93+
- `aoti_metal_blob.ptd` - The Metal kernel blob required for runtime
94+
95+
See the "Building the multimodal runner" section below for instructions on building with Metal support, and the "Running the model" section for runtime instructions.
96+
3997
# Running the model
4098
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
4199
The Voxtral runner will do the following things:
@@ -52,7 +110,12 @@ We provide a simple way to transform raw audio data into a mel spectrogram by ex
52110

53111
```
54112
# Export a preprocessor that can handle audio up to 5 mins (300s).
55-
python -m executorch.extension.audio.mel_spectrogram --feature_size 128 --stack_output --max_audio_len 300 --output_file voxtral_preprocessor.pte
113+
114+
python -m executorch.extension.audio.mel_spectrogram \
115+
--feature_size 128 \
116+
--stack_output \
117+
--max_audio_len 300 \
118+
--output_file voxtral_preprocessor.pte
56119
```
57120

58121
## Building the multimodal runner
@@ -64,18 +127,73 @@ cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -
64127
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release
65128
```
66129

130+
### Building for CUDA
131+
```
132+
# Install ExecuTorch with CUDA support
133+
CMAKE_ARGS="-DEXECUTORCH_BUILD_CUDA=ON" ./install_executorch.sh
134+
135+
# Build the multimodal runner with CUDA
136+
cmake --preset llm \
137+
-DEXECUTORCH_BUILD_CUDA=ON \
138+
-DCMAKE_INSTALL_PREFIX=cmake-out \
139+
-DCMAKE_BUILD_TYPE=Release \
140+
-Bcmake-out -S.
141+
cmake --build cmake-out -j16 --target install --config Release
142+
143+
cmake -DEXECUTORCH_BUILD_CUDA=ON \
144+
-DCMAKE_BUILD_TYPE=Release \
145+
-Sexamples/models/voxtral \
146+
-Bcmake-out/examples/models/voxtral/
147+
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
148+
```
149+
150+
### Building for Metal
151+
```
152+
# Install ExecuTorch with Metal support
153+
CMAKE_ARGS="-DEXECUTORCH_BUILD_METAL=ON" ./install_executorch.sh
154+
155+
# Build the multimodal runner with Metal
156+
cmake --preset llm \
157+
-DEXECUTORCH_BUILD_METAL=ON \
158+
-DCMAKE_INSTALL_PREFIX=cmake-out \
159+
-DCMAKE_BUILD_TYPE=Release \
160+
-Bcmake-out -S.
161+
cmake --build cmake-out -j16 --target install --config Release
162+
163+
cmake -DEXECUTORCH_BUILD_METAL=ON \
164+
-DCMAKE_BUILD_TYPE=Release \
165+
-Sexamples/models/voxtral \
166+
-Bcmake-out/examples/models/voxtral/
167+
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
168+
```
169+
67170
## Running the model
68171
You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
69172
```
70173
./cmake-out/examples/models/voxtral/voxtral_runner \
71174
--model_path path/to/model.pte \
72175
--tokenizer_path path/to/tekken.json \
73176
--prompt "What can you tell me about this audio?" \
74-
--audio_path path/to/audio_input.bin \
75-
--processor_path path/to/voxtral_preprocessor.pte # If you're passing raw audio file in audio_path
177+
--audio_path path/to/audio_input.wav \
178+
--processor_path path/to/voxtral_preprocessor.pte
76179
```
77180

78-
Example output:
181+
### Running with preprocessed audio (.bin file)
182+
If you already have a preprocessed mel spectrogram saved as a `.bin` file, you can skip the preprocessor:
183+
```
184+
./cmake-out/examples/models/voxtral/voxtral_runner \
185+
--model_path path/to/model.pte \
186+
--tokenizer_path path/to/tekken.json \
187+
--prompt "What can you tell me about this audio?" \
188+
--audio_path path/to/preprocessed_audio.bin
189+
```
190+
191+
### Running on CUDA or Metal:
192+
Add the `--data_path` argument to provide the appropriate data blob to the commands above:
193+
- For CUDA: `--data_path path/to/aoti_cuda_blob.ptd`
194+
- For Metal: `--data_path path/to/aoti_metal_blob.ptd`
195+
196+
# Example output:
79197
```
80198
The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that
81199
the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might
@@ -89,6 +207,7 @@ I 00:00:24.036822 executorch:stats.h:147] Time to first generated token:
89207
I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds)
90208
```
91209

210+
# Generating audio input
92211
You can easily produce an `.bin` for the audio input in Python like this:
93212
```
94213
# t = some torch.Tensor
@@ -101,3 +220,13 @@ You can also produce raw audio file as follows (for Option A):
101220
```
102221
ffmpeg -i audio.mp3 -f f32le -acodec pcm_f32le -ar 16000 audio_input.bin
103222
```
223+
224+
### Generating a .wav file on Mac
225+
On macOS, you can use the built-in `say` command to generate speech audio and convert it to a `.wav` file:
226+
```
227+
# Generate audio using text-to-speech
228+
say -o call_samantha_hall.aiff "Call Samantha Hall"
229+
230+
# Convert to .wav format
231+
afconvert -f WAVE -d LEI16 call_samantha_hall.aiff call_samantha_hall.wav
232+
```

0 commit comments

Comments
 (0)