Skip to content

Commit b19c227

Browse files
qibaoyuanyenuo26wangyu31577Zhang Shijinhsliuustc0106
authored
[New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support (#750)
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com> Signed-off-by: hsliu <liuhongsheng4@huawei.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: GG-li <3226868735@qq.com> Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: Baoyuan Qi <qibaoyuan@126.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com> Signed-off-by: dongbo910220 <1275604947@qq.com> Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: baoyuan qi <qibaoyuan@126.com> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Prajwal A <prajwalanagani@gmail.com> Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com> Signed-off-by: 丁宁 <nndding@gmail.com> Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com> Signed-off-by: dingning<dingning7@xiaomi.com> Signed-off-by: dingning <dingning7@xiaomi.com> Signed-off-by: dingning <dingning@xiaomi.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: shijin zhang <zsj1364226740@gmail.com> Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk> Co-authored-by: root <root@hk01dgx028.cm.cluster> Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com> Co-authored-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com> Co-authored-by: dingning <dingning7@xiaomi.com> Co-authored-by: ning ding <nndding@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent c9b31a4 commit b19c227

23 files changed

+8041
-0
lines changed

docs/models/supported_models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ th {
4343
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` |
4444
|`GlmImageForConditionalGeneration` | GLM-Image | `zai-org/GLM-Image` |
4545
|`NextStep11Pipeline` | NextStep-1.1 | `stepfun-ai/NextStep-1.1` |
46+
|`MiMoAudioForConditionalGeneration` | MiMo-Audio-7B-Instruct | `XiaomiMiMo/MiMo-Audio-7B-Instruct` |
4647

4748

4849
## List of Supported Models for NPU
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# MiMo-Audio Offline Inference
2+
3+
This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally.
4+
5+
## Model Overview
6+
7+
MiMo-Audio provides multiple task variants for audio understanding and generation:
8+
9+
- **tts_sft**: Basic text-to-speech generation from text input.
10+
- **tts_sft_with_instruct**: TTS generation with explicit voice style instructions.
11+
- **tts_sft_with_audio**: TTS generation with audio reference for voice cloning.
12+
- **tts_sft_with_natural_instruction**: TTS generation from natural language descriptions embedded in text.
13+
- **audio_trancribing_sft**: Transcribe audio to text (speech-to-text).
14+
- **audio_understanding_sft**: Understand and analyze audio content with text queries.
15+
- **audio_understanding_sft_with_thinking**: Audio understanding with reasoning chain.
16+
- **spoken_dialogue_sft_multiturn**: Multi-turn spoken dialogue with audio input/output.
17+
- **speech2text_dialogue_sft_multiturn**: Multi-turn dialogue converting speech to text.
18+
- **text_dialogue_sft_multiturn**: Multi-turn text-only dialogue.
19+
20+
## Setup
21+
22+
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
23+
24+
### Environment Variables
25+
26+
The `MIMO_AUDIO_TOKENIZER_PATH` environment variable is mandatory due to the specialized architecture:
27+
28+
```bash
29+
export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"
30+
```
31+
32+
## Quick Start
33+
34+
Run a single sample for basic TTS:
35+
36+
```bash
37+
python3 -u end2end.py \
38+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
39+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
40+
--query-type tts_sft
41+
```
42+
43+
Run batch samples for basic TTS:
44+
45+
```bash
46+
python3 -u end2end.py \
47+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
48+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
49+
--query-type tts_sft \
50+
--num-prompts {batch_size}
51+
```
52+
53+
When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the `max_model_len` value in the `mimo_audio.yaml` configuration file, you must also synchronously update the `max_position_embeddings` value in `MiMo-Audio-7B-Instruct/config.json` to match the modified value.
54+
55+
Generated audio files are saved to `output_audio/` by default. `--num-prompts` also can be used to all tasks below.
56+
57+
## Task Usage
58+
59+
### tts_sft (Basic Text-to-Speech)
60+
61+
Generate speech from text input:
62+
63+
```bash
64+
python3 -u end2end.py \
65+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
66+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
67+
--query-type tts_sft \
68+
--text "The weather is so nice today."
69+
```
70+
71+
### tts_sft_with_instruct (TTS with Voice Instructions)
72+
73+
Generate speech with explicit voice style instructions:
74+
75+
```bash
76+
python3 -u end2end.py \
77+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
78+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
79+
--query-type tts_sft_with_instruct \
80+
--text "The weather is so nice today." \
81+
--instruct "Speak happily in a child's voice"
82+
```
83+
84+
### tts_sft_with_audio (TTS with Audio Reference)
85+
86+
Generate speech using an audio reference for voice cloning:
87+
88+
```bash
89+
python3 -u end2end.py \
90+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
91+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
92+
--query-type tts_sft_with_audio \
93+
--text "The weather is so nice today." \
94+
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
95+
```
96+
97+
### tts_sft_with_natural_instruction (Natural Language TTS)
98+
99+
Generate speech from text containing natural voice descriptions:
100+
101+
```bash
102+
python3 -u end2end.py \
103+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
104+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
105+
--query-type tts_sft_with_natural_instruction \
106+
--text "In a panting young male voice, he said: I can't run anymore, wait for me!"
107+
```
108+
109+
### audio_trancribing_sft (Speech-to-Text)
110+
111+
Transcribe audio to text:
112+
113+
```bash
114+
python3 -u end2end.py \
115+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
116+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
117+
--query-type audio_trancribing_sft \
118+
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
119+
```
120+
121+
### audio_understanding_sft (Audio Understanding)
122+
123+
Understand and analyze audio content with text queries:
124+
125+
```bash
126+
python3 -u end2end.py \
127+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
128+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
129+
--query-type audio_understanding_sft \
130+
--text "Summarize the audio." \
131+
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
132+
```
133+
134+
### audio_understanding_sft_with_thinking (Audio Understanding with Reasoning)
135+
136+
Audio understanding with reasoning chain:
137+
138+
```bash
139+
python3 -u end2end.py \
140+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
141+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
142+
--query-type audio_understanding_sft_with_thinking \
143+
--text "Summarize the audio." \
144+
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
145+
```
146+
147+
### spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue)
148+
149+
Multi-turn dialogue with audio input and output:
150+
151+
```bash
152+
python3 -u end2end.py \
153+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
154+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
155+
--query-type spoken_dialogue_sft_multiturn \
156+
--audio-path "./prompt_speech_zh_m.wav"
157+
```
158+
159+
Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
160+
161+
### speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue)
162+
163+
Multi-turn dialogue converting speech to text:
164+
165+
```bash
166+
python3 -u end2end.py \
167+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
168+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
169+
--query-type speech2text_dialogue_sft_multiturn
170+
```
171+
172+
Note: This task uses hardcoded audio files and message lists in the script.
173+
174+
### text_dialogue_sft_multiturn (Text Dialogue)
175+
176+
Multi-turn text-only dialogue:
177+
178+
```bash
179+
python3 -u end2end.py \
180+
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
181+
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
182+
--query-type text_dialogue_sft_multiturn
183+
```
184+
185+
Note: This task uses hardcoded message lists in the script.
186+
187+
## Troubleshooting
188+
189+
### Audio dependencies (soundfile, librosa)
190+
191+
This example depends on **soundfile** (read/write WAV) and **librosa** (load audio including MP3). Install the project requirements first:
192+
193+
```bash
194+
pip install -r requirements/common.txt
195+
# or at least: pip install soundfile>=0.13.1 librosa>=0.11.0
196+
```
197+
198+
- **`soundfile` / libsndfile not found**
199+
`soundfile` uses the C library **libsndfile**. On Linux, install the system package before pip:
200+
- Debian/Ubuntu: `sudo apt-get install libsndfile1`
201+
- For development builds: `sudo apt-get install libsndfile1-dev`
202+
- Then: `pip install soundfile`
203+
204+
- **`librosa` fails to load MP3 or reports "No backend available"**
205+
Loading MP3 (e.g. in `spoken_dialogue_sft_multiturn` with `.mp3` files) uses **ffmpeg** as the backend. Install ffmpeg:
206+
- Debian/Ubuntu: `sudo apt-get install ffmpeg`
207+
- macOS: `brew install ffmpeg`
208+
209+
- **`ImportError: No module named 'soundfile'` or `ModuleNotFoundError: ... librosa`**
210+
Ensure you are in the same Python environment where vLLM Omni and the example dependencies are installed, and that `requirements/common.txt` (or the packages above) are installed.
211+
212+
### Tokenizer path
213+
214+
- **`MIMO_AUDIO_TOKENIZER_PATH` not set or model fails to find tokenizer**
215+
Export the tokenizer path before running:
216+
```bash
217+
export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"
218+
```
219+
See [Environment Variables](#environment-variables) in Setup.
220+
221+
### Other
222+
223+
- If the model or stage config fails to load, check [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) for memory and GPU settings.
224+
- For errors when reading/writing WAV (e.g. unsupported format), ensure input files are standard WAV/MP3 and that `soundfile` is linked to a working libsndfile (see above).
225+
226+
## Notes
227+
228+
- The script uses default model paths and audio files embedded in `end2end.py`. Update them if your local cache path differs.
229+
- Use `--output-dir` to change the output folder (default: `./output_audio`).
230+
- Use `--num-prompts` to generate multiple prompts in one run (default: 1).
231+
- Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
232+
- The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See `--help` for details.

0 commit comments

Comments
 (0)