You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We export Voxtral using the Optimum CLI, which will export `model.pte` to the `voxtral` output directory:
28
28
```
29
-
optimum-cli export executorch
30
-
--model "mistralai/Voxtral-Mini-3B-2507"
31
-
--task "multimodal-text-to-text"
32
-
--recipe "xnnpack"
33
-
--use_custom_sdpa
34
-
--use_custom_kv_cache
35
-
--qlinear 8da4w
36
-
--qembedding 4w
37
-
--output_dir="voxtral
29
+
optimum-cli export executorch \
30
+
--model "mistralai/Voxtral-Mini-3B-2507" \
31
+
--task "multimodal-text-to-text" \
32
+
--recipe "xnnpack" \
33
+
--use_custom_sdpa \
34
+
--use_custom_kv_cache \
35
+
--qlinear 8da4w \
36
+
--qembedding 4w \
37
+
--output_dir="voxtral"
38
38
```
39
39
40
40
This exports Voxtral with XNNPack backend acceleration and 4-bit weight/8-bit activation linear quantization.
41
41
42
-
# [Optional] Exporting the audio preprocessor
42
+
# Running the model
43
+
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
44
+
The Voxtral runner will do the following things:
45
+
46
+
- Audio Input:
47
+
- Option A: Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
48
+
- Option B: If starting directly with an already processed audio input tensor, format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
49
+
- Feed the formatted inputs to the multimodal modal runner.
50
+
51
+
52
+
# [Option A] Exporting the audio preprocessor
43
53
The exported model takes in a mel spectrogram input tensor as its audio inputs.
44
54
We provide a simple way to transform raw audio data into a mel spectrogram by exporting a version of Voxtral's audio preprocessor used directly by Transformers.
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
52
-
The Voxtral runner will do the following things:
53
-
1.[Optional] Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
54
-
2.[If starting directly with an already processed audio input tensor] Format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
55
-
3. Feed the formatted inputs to the multimodal modal runner.
0 commit comments