You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pull request adds support for loading and processing `.wav` audio
files in the multimodal runner, alongside existing `.bin` file support.
It introduces a dedicated WAV loader utility, updates the runner to
dispatch audio file processing based on file type, and adds
comprehensive tests for WAV file parsing and normalization. These
changes improve flexibility and robustness when handling audio inputs.
**WAV file support and audio processing:**
* Added a new utility `wav_loader.h` that provides functions to parse
WAV file headers and load normalized PCM audio data from `.wav` files,
supporting 16-bit and 32-bit PCM formats.
* Updated `multimodal.cpp` to support loading audio from both `.bin` and
`.wav` files, including input validation and error handling for
unsupported formats. The runner now uses the processor for both file
types and enforces processor requirements for `.wav` files.
[[1]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843L138-R149)
[[2]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843L166-R191)
[[3]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843R247-L255)
* Added a new command-line flag `data_path` and passed it to the
multimodal runner to facilitate data file handling.
[[1]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843R38)
[[2]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843R294)
[[3]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843L297-R322)
**Testing and build integration:**
* Introduced `test_wav_loader.cpp`, which provides unit tests for WAV
header parsing, sample normalization, error handling, and unsupported
format detection.
* Registered the new utility and tests in build configuration files,
ensuring proper header exports and test coverage.
[[1]](diffhunk://#diff-8a73187dfda9c5479db6911bee649164ff4434d36e8f4eb881cc1f049c4e3271R108)
[[2]](diffhunk://#diff-24b61cfeb7f1fc9a646df385ece0c31ea2ab18b3c7e34fc62117c62538e111ffL22-R22)
[[3]](diffhunk://#diff-c8ef93f128805fc48fe2d7c1dadb9ff5d2f4dc5ee7c00b638fd193d3dfb1f06cR47-R56)
[[4]](diffhunk://#diff-d755455ed59da7a902bb5a5c1e540a1924f63e8f70a9dc78b455f2c569a19db6R17)
Copy file name to clipboardExpand all lines: examples/models/voxtral/README.md
+17-4Lines changed: 17 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,8 +41,8 @@ To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's Mu
41
41
The Voxtral runner will do the following things:
42
42
43
43
- Audio Input:
44
-
- Option A: Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
45
-
- Option B: If starting directly with an already processed audio input tensor, format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
44
+
- Option A: Pass raw audio data from a `.wav` file into the exported preprocessor to produce a mel spectrogram tensor.
45
+
- Option B: If starting directly with an already processed audio input tensor (preprocessed mel spectrogram), format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
46
46
- Feed the formatted inputs to the multimodal modal runner.
0 commit comments