Skip to content

Commit 66c3dea

Browse files
authored
Add a wav loader (pytorch#14923)
This pull request adds support for loading and processing `.wav` audio files in the multimodal runner, alongside existing `.bin` file support. It introduces a dedicated WAV loader utility, updates the runner to dispatch audio file processing based on file type, and adds comprehensive tests for WAV file parsing and normalization. These changes improve flexibility and robustness when handling audio inputs. **WAV file support and audio processing:** * Added a new utility `wav_loader.h` that provides functions to parse WAV file headers and load normalized PCM audio data from `.wav` files, supporting 16-bit and 32-bit PCM formats. * Updated `multimodal.cpp` to support loading audio from both `.bin` and `.wav` files, including input validation and error handling for unsupported formats. The runner now uses the processor for both file types and enforces processor requirements for `.wav` files. [[1]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843L138-R149) [[2]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843L166-R191) [[3]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843R247-L255) * Added a new command-line flag `data_path` and passed it to the multimodal runner to facilitate data file handling. [[1]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843R38) [[2]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843R294) [[3]](diffhunk://#diff-0ac16dbe4eaefa08e21fbda582fe2cd2b482f43aaedfc1bf2f31becf5e7bb843L297-R322) **Testing and build integration:** * Introduced `test_wav_loader.cpp`, which provides unit tests for WAV header parsing, sample normalization, error handling, and unsupported format detection. * Registered the new utility and tests in build configuration files, ensuring proper header exports and test coverage. [[1]](diffhunk://#diff-8a73187dfda9c5479db6911bee649164ff4434d36e8f4eb881cc1f049c4e3271R108) [[2]](diffhunk://#diff-24b61cfeb7f1fc9a646df385ece0c31ea2ab18b3c7e34fc62117c62538e111ffL22-R22) [[3]](diffhunk://#diff-c8ef93f128805fc48fe2d7c1dadb9ff5d2f4dc5ee7c00b638fd193d3dfb1f06cR47-R56) [[4]](diffhunk://#diff-d755455ed59da7a902bb5a5c1e540a1924f63e8f70a9dc78b455f2c569a19db6R17)
1 parent d0827e5 commit 66c3dea

File tree

8 files changed

+460
-45
lines changed

8 files changed

+460
-45
lines changed

examples/models/voxtral/README.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's Mu
4141
The Voxtral runner will do the following things:
4242

4343
- Audio Input:
44-
- Option A: Pass the raw audio tensor into exported preprocessor to produce a mel spectrogram tensor.
45-
- Option B: If starting directly with an already processed audio input tensor, format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
44+
- Option A: Pass raw audio data from a `.wav` file into the exported preprocessor to produce a mel spectrogram tensor.
45+
- Option B: If starting directly with an already processed audio input tensor (preprocessed mel spectrogram), format the inputs to the multimodal runner (metadata tokens, audio tokens, text tokens, etc.).
4646
- Feed the formatted inputs to the multimodal modal runner.
4747

4848

@@ -66,13 +66,26 @@ cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Re
6666

6767
## Running the model
6868
You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
69+
70+
### Running with raw audio (.wav file)
71+
For raw audio files (`.wav`), you must provide a preprocessor to convert the audio into mel spectrogram format:
72+
```
73+
./cmake-out/examples/models/voxtral/voxtral_runner \
74+
--model_path path/to/model.pte \
75+
--tokenizer_path path/to/tekken.json \
76+
--prompt "What can you tell me about this audio?" \
77+
--audio_path path/to/audio_input.wav \
78+
--processor_path path/to/voxtral_preprocessor.pte
79+
```
80+
81+
### Running with preprocessed audio (.bin file)
82+
If you already have a preprocessed mel spectrogram saved as a `.bin` file, you can skip the preprocessor:
6983
```
7084
./cmake-out/examples/models/voxtral/voxtral_runner \
7185
--model_path path/to/model.pte \
7286
--tokenizer_path path/to/tekken.json \
7387
--prompt "What can you tell me about this audio?" \
74-
--audio_path path/to/audio_input.bin \
75-
--processor_path path/to/voxtral_preprocessor.pte # If you're passing raw audio file in audio_path
88+
--audio_path path/to/preprocessed_audio.bin
7689
```
7790

7891
Example output:

examples/models/voxtral/multimodal.cpp

Lines changed: 65 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
#include <executorch/extension/llm/runner/llm_runner_helper.h>
2222
#include <executorch/extension/llm/runner/multimodal_input.h>
2323
#include <executorch/extension/llm/runner/multimodal_runner.h>
24+
#include <executorch/extension/llm/runner/wav_loader.h>
2425
#include <executorch/runtime/core/error.h>
2526
#include <executorch/runtime/platform/log.h>
2627

@@ -34,6 +35,7 @@ DEFINE_string(
3435
"multimodal.pte",
3536
"Model serialized in flatbuffer format.");
3637

38+
DEFINE_string(data_path, "", "Path to data file.");
3739
DEFINE_string(tokenizer_path, "tekken.json", "Tokenizer stuff.");
3840

3941
DEFINE_string(prompt, "What is happening in this audio?", "Text prompt.");
@@ -113,15 +115,15 @@ MultimodalInput loadPreprocessedAudio(const std::string& audio_path) {
113115
}
114116

115117
/**
116-
* @brief Loads a .bin file into a tensor and processes it using a .pte
117-
* processor
118+
* @brief Loads raw audio from a .bin or .wav file and processes it using a
119+
* .pte processor
118120
*
119-
* This function loads raw audio data from a .bin file (similar to
120-
* loadPreprocessedAudio), creates a tensor from it, and then passes it through
121-
* a processor module loaded from a .pte file to generate processed audio
122-
* features.
121+
* This function loads raw audio data from either a .bin file (raw float array)
122+
* or a .wav file (WAV format with headers), creates a tensor from it, and then
123+
* passes it through a processor module loaded from a .pte file to generate
124+
* processed audio features.
123125
*
124-
* @param audio_path Path to the .bin audio file
126+
* @param audio_path Path to the .bin or .wav audio file
125127
* @param processor_path Path to the .pte processor file
126128
* @return MultimodalInput containing the processed audio data
127129
* @throws std::runtime_error if file loading or processing fails
@@ -135,6 +137,41 @@ MultimodalInput processRawAudioFile(
135137
"Processor path is required for raw audio processing");
136138
}
137139

140+
// Load the audio data from file (.bin or .wav)
141+
std::vector<float> audio_data;
142+
if (ends_with(audio_path, ".wav")) {
143+
audio_data = ::executorch::extension::llm::load_wav_audio_data(audio_path);
144+
ET_LOG(
145+
Info,
146+
"Loaded WAV file: %s, %zu samples",
147+
audio_path.c_str(),
148+
audio_data.size());
149+
} else if (ends_with(audio_path, ".bin")) {
150+
std::ifstream f(audio_path, std::ios::binary | std::ios::ate);
151+
if (!f.is_open()) {
152+
ET_LOG(Error, "Failed to open audio file: %s", audio_path.c_str());
153+
throw std::runtime_error("Failed to open audio file");
154+
}
155+
156+
std::size_t n_floats = f.tellg() / sizeof(float);
157+
f.seekg(0, std::ios::beg);
158+
159+
audio_data.resize(n_floats);
160+
f.read(
161+
reinterpret_cast<char*>(audio_data.data()),
162+
audio_data.size() * sizeof(float));
163+
f.close();
164+
165+
ET_LOG(
166+
Info, "Loaded .bin file: %s, %zu floats", audio_path.c_str(), n_floats);
167+
} else {
168+
ET_LOG(
169+
Error,
170+
"Unsupported audio file format: %s (only .bin and .wav files are supported)",
171+
audio_path.c_str());
172+
throw std::runtime_error("Unsupported audio file format");
173+
}
174+
138175
// Load the audio processor .pte.
139176
std::unique_ptr<Module> processor_module;
140177
try {
@@ -153,25 +190,6 @@ MultimodalInput processRawAudioFile(
153190
throw std::runtime_error("Exception while loading processor module");
154191
}
155192

156-
// Load the audio data from file.
157-
std::ifstream f(audio_path, std::ios::binary | std::ios::ate);
158-
if (!f.is_open()) {
159-
ET_LOG(Error, "Failed to open audio file: %s", audio_path.c_str());
160-
throw std::runtime_error("Failed to open audio file");
161-
}
162-
163-
std::size_t n_floats = f.tellg() / sizeof(float);
164-
f.seekg(0, std::ios::beg);
165-
166-
std::vector<float> audio_data(n_floats);
167-
f.read(
168-
reinterpret_cast<char*>(audio_data.data()),
169-
audio_data.size() * sizeof(float));
170-
f.close();
171-
172-
ET_LOG(
173-
Info, "Loaded .bin file: %s, %zu floats", audio_path.c_str(), n_floats);
174-
175193
// Execute the processor
176194
std::vector<executorch::aten::SizesType> tensor_shape = {
177195
static_cast<executorch::aten::SizesType>(audio_data.size())};
@@ -226,33 +244,39 @@ MultimodalInput processRawAudioFile(
226244
*
227245
* Dispatches audio file processing based on file extension and processor
228246
* availability:
247+
* - .wav files: Requires processor, processes raw audio through processor
229248
* - .bin files with processor: Loads raw audio from .bin and processes through
230249
* processor
231250
* - .bin files without processor: Loads preprocessed mel spectrogram features
232251
* directly
233252
*
234-
* @param audio_path Path to the audio file (.bin)
235-
* @param processor_path Path to the processor .pte file (optional)
253+
* @param audio_path Path to the audio file (.bin or .wav)
254+
* @param processor_path Path to the processor .pte file (optional for .bin,
255+
* required for .wav)
236256
* @return MultimodalInput containing the processed audio data
237257
* @throws std::runtime_error if file format is unsupported or processing fails
238258
*/
239259
MultimodalInput processAudioFile(
240260
const std::string& audio_path,
241261
const std::string& processor_path = "") {
242-
if (ends_with(audio_path, ".bin")) {
243-
if (!processor_path.empty()) {
244-
// Process raw audio from .bin file through the processor
245-
return processRawAudioFile(audio_path, processor_path);
246-
} else {
247-
// Load preprocessed audio stored as a binary file (existing behavior)
248-
return loadPreprocessedAudio(audio_path);
262+
if (ends_with(audio_path, ".wav") || ends_with(audio_path, ".bin")) {
263+
if (processor_path.empty()) {
264+
if (ends_with(audio_path, ".wav")) {
265+
ET_CHECK_MSG(
266+
false,
267+
"Processor path is required for .wav file processing: %s",
268+
audio_path.c_str());
269+
} else {
270+
// Load preprocessed audio stored as a binary file (existing behavior)
271+
return loadPreprocessedAudio(audio_path);
272+
}
249273
}
274+
return processRawAudioFile(audio_path, processor_path);
250275
} else {
251-
ET_LOG(
252-
Error,
253-
"Unsupported audio file format: %s (only .bin files are supported)",
276+
ET_CHECK_MSG(
277+
false,
278+
"Unsupported audio file format: %s (only .bin and .wav files are supported)",
254279
audio_path.c_str());
255-
throw std::runtime_error("Unsupported audio file format");
256280
}
257281
}
258282

@@ -267,6 +291,7 @@ int32_t main(int32_t argc, char** argv) {
267291
const char* prompt = FLAGS_prompt.c_str();
268292
const char* audio_path = FLAGS_audio_path.c_str();
269293
const char* processor_path = FLAGS_processor_path.c_str();
294+
const char* data_path = FLAGS_data_path.c_str();
270295
float temperature = FLAGS_temperature;
271296
int32_t cpu_threads = FLAGS_cpu_threads;
272297
bool warmup = FLAGS_warmup;
@@ -294,7 +319,7 @@ int32_t main(int32_t argc, char** argv) {
294319
// Create multimodal runner
295320
std::unique_ptr<::executorch::extension::llm::MultimodalRunner> runner =
296321
::executorch::extension::llm::create_multimodal_runner(
297-
model_path, std::move(tokenizer));
322+
model_path, std::move(tokenizer), data_path);
298323
if (runner == nullptr) {
299324
ET_LOG(Error, "Failed to create multimodal runner");
300325
return 1;

extension/llm/runner/targets.bzl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ def define_common_targets():
105105
exported_headers = [
106106
"audio.h",
107107
"image.h",
108+
"wav_loader.h",
108109
"multimodal_input.h",
109110
"multimodal_runner.h",
110111
"multimodal_prefiller.h",

extension/llm/runner/test/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ include(${EXECUTORCH_ROOT}/tools/cmake/Test.cmake)
1919

2020
set(_test_srcs
2121
test_generation_config.cpp test_text_llm_runner.cpp test_text_prefiller.cpp
22-
test_text_decoder_runner.cpp test_multimodal_input.cpp
22+
test_text_decoder_runner.cpp test_multimodal_input.cpp test_wav_loader.cpp
2323
)
2424

2525
# Add LSan stub for Apple platforms

extension/llm/runner/test/targets.bzl

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,13 @@ def define_common_targets():
4444
"//executorch/extension/llm/runner:multimodal_runner_lib",
4545
],
4646
)
47+
48+
runtime.cxx_test(
49+
name = "test_wav_loader",
50+
srcs = ["test_wav_loader.cpp"],
51+
deps = [
52+
"//executorch/extension/testing_util:temp_file",
53+
"//executorch/extension/llm/runner:multimodal_runner_lib",
54+
"//executorch/runtime/platform:platform",
55+
],
56+
)

0 commit comments

Comments
 (0)