-
Notifications
You must be signed in to change notification settings - Fork 667
[New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support #750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 185 commits
2d9ca87
fcc2de3
62aae30
f6e6dd0
56535a4
af5fd2a
90e0274
4516b91
44bf712
7ec1143
ba0b4fc
d1937a5
5a3a853
4597350
c051b1e
79c3286
6f1bc42
2d94fb4
fcfd700
d237706
4b2f0a5
6846a97
beb38a7
657c267
0e92408
282735f
ba46010
a5ff60e
422a7f2
7e672b3
06059db
bac71be
ab5b365
1564dbb
6d4ac27
7b03a1d
1bc045b
25d0074
4fa693d
96d42be
abc0971
4280e9a
7842f3a
e70cdee
4c846c4
9c35167
aa2f61e
15416bd
417fb35
487653b
bcc33f0
fa14126
949b730
d1f3689
cf7a9b1
278ed17
ed955e9
cf2243a
b72023a
c70a0f4
9155296
91e4d77
f53c95f
1564a00
367107d
e0cb70c
1e7cc96
e3d30ba
b42d3f8
a96eebb
7a3b803
32b1777
59dbaab
583dc43
e764450
e92a011
6acd81c
fe82b4a
2da2577
4a036a8
638139a
2b777cb
18b27fe
f2e1a72
00085a0
8f5113f
7c858c0
f6af4f0
0693c81
d6edd8a
4faae11
8634cc1
963f36b
d492998
4a94c12
26172a5
6098fa6
c43b4b0
a59e74a
cd55822
3971d24
06959f4
3ba923e
39cdbc4
3fe9484
9366109
1b462e6
6ecdc54
41d0e96
0549846
ae8c256
340b257
65d1645
e189317
0c9fa9d
35af870
66659b2
980f93d
2eb12ea
76996b3
45d4ae8
516cbeb
024f213
75b3737
9a852d2
f8c639a
c13b499
7089d13
b7f7774
8c5ad26
2388f7f
8ed7f0b
962aee8
c4649f0
0e03317
a015530
00cde49
5c0e367
ad4b60b
2a59811
dde676d
821e81e
045264b
31a5097
aabaa4a
f1474ed
be77436
330f5b2
a92b110
93080da
14eac8b
a45a21e
e8d9f25
ae23b97
cd2d138
5512694
d727972
4a58ae7
113e554
4e2cc79
157fbe1
44f8e32
7cfe57a
4d0b524
952c0c3
41d0630
c4e6558
a5232cc
5677c89
f56a4c1
9726769
019f139
64235e3
df72fdb
e9d1e57
ad2abb4
07e954e
247f9d4
09e5670
aedba9f
d60a749
5300820
77fb0f3
2f106db
cd864bc
8244e62
5ed5686
1876af3
1c8cc79
2427fe8
df7626a
c5b67e0
00f3988
d6ebaa5
1a54954
f97de11
6d92db4
b35768c
910f222
cbec602
973a6eb
c8b4fa9
ae9b3d2
eddfcdd
a756fd5
87fa83e
1b6d6b0
da87773
fdeac93
0eb52a9
9831fb2
9af6ef9
e9fa9fd
e103f1f
cd3b88d
f71483b
e3d60cc
b021820
315959d
a08e38c
f86bd63
9a2c174
a11de27
851ed87
85ff4c4
d6b2863
aa17625
35e142c
9577518
52adfbf
47ac5fa
a560f39
f05f6d6
b9bc051
0e4a886
23f35f1
00597f3
e239f4c
0ce3f85
0b7137f
20db178
0197087
8b4db47
2680438
240b23e
2c8ff07
cb6c1ec
612dbfd
c1fea5a
f122296
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| # MiMo-Audio Offline Inference | ||
|
|
||
| This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally. | ||
|
|
||
| ## Model Overview | ||
|
|
||
| MiMo-Audio provides multiple task variants for audio understanding and generation: | ||
|
|
||
| - **tts_sft**: Basic text-to-speech generation from text input. | ||
| - **tts_sft_with_instruct**: TTS generation with explicit voice style instructions. | ||
| - **tts_sft_with_audio**: TTS generation with audio reference for voice cloning. | ||
| - **tts_sft_with_natural_instruction**: TTS generation from natural language descriptions embedded in text. | ||
| - **audio_trancribing_sft**: Transcribe audio to text (speech-to-text). | ||
| - **audio_understanding_sft**: Understand and analyze audio content with text queries. | ||
| - **audio_understanding_sft_with_thinking**: Audio understanding with reasoning chain. | ||
| - **spoken_dialogue_sft_multiturn**: Multi-turn spoken dialogue with audio input/output. | ||
| - **speech2text_dialogue_sft_multiturn**: Multi-turn dialogue converting speech to text. | ||
| - **text_dialogue_sft_multiturn**: Multi-turn text-only dialogue. | ||
|
|
||
| ## Setup | ||
|
|
||
| Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup. | ||
|
|
||
| ### Environment Variables | ||
|
|
||
| The `MIMO_AUDIO_TOKENIZER_PATH` environment variable is mandatory due to the specialized architecture: | ||
|
|
||
| ```bash | ||
| export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer" | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Run a single sample for basic TTS: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft | ||
| ``` | ||
|
|
||
| Run batch samples for basic TTS: | ||
|
|
||
| ```bash | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding a troubleshooting section for common installation issues, especially around audio dependencies.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done
qibaoyuan marked this conversation as resolved.
Show resolved
Hide resolved
qibaoyuan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft \ | ||
| --num-prompts {batch_size} | ||
| ``` | ||
|
|
||
| When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the `max_model_len` value in the `mimo_audio.yaml` configuration file, you must also synchronously update the `max_position_embeddings` value in `MiMo-Audio-7B-Instruct/config.json` to match the modified value. | ||
|
|
||
| Generated audio files are saved to `output_audio/` by default. `--num-prompts` also can be used to all tasks below. | ||
|
|
||
| ## Task Usage | ||
|
|
||
| ### tts_sft (Basic Text-to-Speech) | ||
|
|
||
| Generate speech from text input: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft \ | ||
| --text "The weather is so nice today." | ||
| ``` | ||
|
|
||
| ### tts_sft_with_instruct (TTS with Voice Instructions) | ||
|
|
||
| Generate speech with explicit voice style instructions: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The port 8080 is used here, but the online serving example uses 8000. This inconsistency could confuse users. Consider standardizing on one port.
qibaoyuan marked this conversation as resolved.
Show resolved
Hide resolved
qibaoyuan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --query-type tts_sft_with_instruct \ | ||
| --text "The weather is so nice today." \ | ||
| --instruct "Speak happily in a child's voice" | ||
| ``` | ||
|
|
||
| ### tts_sft_with_audio (TTS with Audio Reference) | ||
|
|
||
| Generate speech using an audio reference for voice cloning: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft_with_audio \ | ||
| --text "The weather is so nice today." \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### tts_sft_with_natural_instruction (Natural Language TTS) | ||
|
|
||
| Generate speech from text containing natural voice descriptions: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft_with_natural_instruction \ | ||
| --text "In a panting young male voice, he said: I can't run anymore, wait for me!" | ||
| ``` | ||
|
|
||
| ### audio_trancribing_sft (Speech-to-Text) | ||
|
|
||
| Transcribe audio to text: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type audio_trancribing_sft \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### audio_understanding_sft (Audio Understanding) | ||
|
|
||
| Understand and analyze audio content with text queries: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type audio_understanding_sft \ | ||
| --text "Summarize the audio." \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### audio_understanding_sft_with_thinking (Audio Understanding with Reasoning) | ||
|
|
||
| Audio understanding with reasoning chain: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type audio_understanding_sft_with_thinking \ | ||
| --text "Summarize the audio." \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue) | ||
|
|
||
| Multi-turn dialogue with audio input and output: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type spoken_dialogue_sft_multiturn \ | ||
| --audio-path "./prompt_speech_zh_m.wav" | ||
| ``` | ||
|
|
||
| Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples | ||
|
|
||
| ### speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue) | ||
|
|
||
| Multi-turn dialogue converting speech to text: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type speech2text_dialogue_sft_multiturn | ||
| ``` | ||
|
|
||
| Note: This task uses hardcoded audio files and message lists in the script. | ||
|
|
||
| ### text_dialogue_sft_multiturn (Text Dialogue) | ||
|
|
||
| Multi-turn text-only dialogue: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type text_dialogue_sft_multiturn | ||
| ``` | ||
|
|
||
| Note: This task uses hardcoded message lists in the script. | ||
|
|
||
| ## Notes | ||
|
|
||
| - The script uses default model paths and audio files embedded in `end2end.py`. Update them if your local cache path differs. | ||
| - Use `--output-dir` to change the output folder (default: `./output_audio`). | ||
| - Use `--num-prompts` to generate multiple prompts in one run (default: 1). | ||
| - Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples | ||
| - The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See `--help` for details. | ||
Uh oh!
There was an error while loading. Please reload this page.