Skip to content

Commit a54c323

Browse files
authored
[Model] Support Qwen3-TTS model series (vllm-project#895)
Signed-off-by: Gaohan123 <hgaoaf@connect.ust.hk>
1 parent 0df8e80 commit a54c323

35 files changed

+9642
-42
lines changed

docs/.nav.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ nav:
1414
- Image-To-Video: user_guide/examples/offline_inference/image_to_video.md
1515
- Qwen2.5-Omni: user_guide/examples/offline_inference/qwen2_5_omni.md
1616
- Qwen3-Omni: user_guide/examples/offline_inference/qwen3_omni.md
17+
- Qwen3-TTS Offline Inference: user_guide/examples/offline_inference/qwen3_tts.md
1718
- Text-To-Image: user_guide/examples/offline_inference/text_to_image.md
1819
- Text-To-Video: user_guide/examples/offline_inference/text_to_video.md
1920
- Online Serving:

docs/api/README.md

Lines changed: 16 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,6 @@ Input data structures for multi-modal inputs.
3636

3737
Engine classes for offline and online inference.
3838

39-
- [vllm_omni.diffusion.diffusion_engine.BackgroundResources][]
4039
- [vllm_omni.diffusion.diffusion_engine.DiffusionEngine][]
4140
- [vllm_omni.engine.AdditionalInformationEntry][]
4241
- [vllm_omni.engine.AdditionalInformationPayload][]
@@ -57,38 +56,11 @@ Core scheduling and caching components.
5756
- [vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler][]
5857
- [vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler][]
5958
- [vllm_omni.core.sched.output.OmniNewRequestData][]
60-
61-
## Model Executor
62-
63-
Model execution components.
64-
65-
- [vllm_omni.model_executor.custom_process_mixin.CustomProcessMixin][]
66-
- [vllm_omni.model_executor.models.output_templates.OmniOutput][]
67-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni.Qwen2_5OmniForConditionalGeneration][]
68-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_talker.Qwen2_5OmniTalkerForConditionalGeneration][]
69-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_thinker.Qwen2_5OmniConditionalGenerationMixin][]
70-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_thinker.Qwen2_5OmniThinkerForConditionalGeneration][]
71-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavBigVGANModel][]
72-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavDiTModel][]
73-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavForConditionalGenerationVLLM][]
74-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavModel][]
75-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_old.Qwen2EmbeddingModel][]
76-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_old.Qwen2ForCausalLM][]
77-
- [vllm_omni.model_executor.models.qwen2_5_omni.qwen2_old.Qwen2Model][]
78-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_moe.Qwen3MoeForCausalLM][]
79-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni.Qwen3OmniMoeForConditionalGeneration][]
80-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_code2wav.Qwen3OmniMoeCode2Wav][]
81-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_code_predictor_mtp.Qwen3OmniCodePredictorBaseModel][]
82-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_code_predictor_mtp.Qwen3OmniMoeTalkerCodePredictor][]
83-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeModel][]
84-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeTalkerForConditionalGeneration][]
85-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeTalkerSharedExpertWrapper][]
86-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3MoeLLMForCausalLM][]
87-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3MoeLLMModel][]
88-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3OmniMoeConditionalGenerationMixin][]
89-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3OmniMoeThinkerForConditionalGeneration][]
90-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3OmniMoeThinkerMultiModalProcessor][]
91-
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3OmniMoeThinkerProcessingInfo][]
59+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.DistributedGroupResidualVectorQuantization][]
60+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.DistributedResidualVectorQuantization][]
61+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.EuclideanCodebook][]
62+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.VectorQuantization][]
63+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.preprocess][]
9264

9365
## Configuration
9466

@@ -98,6 +70,17 @@ Configuration classes.
9870
- [vllm_omni.diffusion.cache.teacache.config.TeaCacheConfig][]
9971
- [vllm_omni.distributed.omni_connectors.utils.config.ConnectorSpec][]
10072
- [vllm_omni.distributed.omni_connectors.utils.config.OmniTransferConfig][]
73+
- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSConfig][]
74+
- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSSpeakerEncoderConfig][]
75+
- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSTalkerCodePredictorConfig][]
76+
- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSTalkerConfig][]
77+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_12hz.configuration_qwen3_tts_tokenizer_v2.Qwen3TTSTokenizerV2Config][]
78+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_12hz.configuration_qwen3_tts_tokenizer_v2.Qwen3TTSTokenizerV2DecoderConfig][]
79+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1Config][]
80+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1DecoderBigVGANConfig][]
81+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1DecoderConfig][]
82+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1DecoderDiTConfig][]
83+
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1EncoderConfig][]
10184

10285
## Workers
10386

docs/mkdocs/hooks/generate_api_readme.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@
3434
"name": "Core",
3535
"description": "Core scheduling and caching components.",
3636
},
37-
"model_executor": {
38-
"name": "Model Executor",
39-
"description": "Model execution components.",
40-
},
37+
# "model_executor": {
38+
# "name": "Model Executor",
39+
# "description": "Model execution components.",
40+
# },
4141
"config": {
4242
"name": "Configuration",
4343
"description": "Configuration classes.",

docs/models/supported_models.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ th {
3434
|`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` |
3535
|`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` |
3636
|`StableAudioPipeline` | Stable-Audio-Open | `stabilityai/stable-audio-open-1.0` |
37+
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-CustomVoice | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` |
38+
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-VoiceDesign | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` |
39+
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` |
3740

3841

3942
## List of Supported Models for NPU
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Qwen3-TTS Offline Inference
2+
3+
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_tts>.
4+
5+
6+
This directory contains an offline demo for running Qwen3 TTS models with vLLM Omni. It builds task-specific inputs and generates WAV files locally.
7+
8+
## Model Overview
9+
10+
Qwen3 TTS provides multiple task variants for speech generation:
11+
12+
- **CustomVoice**: Generate speech with a known speaker identity (speaker ID) and optional instruction.
13+
- **VoiceDesign**: Generate speech from text plus a descriptive instruction that designs a new voice.
14+
- **Base**: Voice cloning using a reference audio + reference transcript, with optional mode selection.
15+
16+
## Setup
17+
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
18+
19+
## Quick Start
20+
21+
Run a single sample for a task:
22+
23+
```
24+
python end2end.py --query-type CustomVoice
25+
```
26+
27+
Generated audio files are saved to `output_audio/` by default.
28+
29+
## Task Usage
30+
31+
### CustomVoice
32+
33+
Single sample:
34+
35+
```
36+
python end2end.py --query-type CustomVoice
37+
```
38+
39+
Batch sample (multiple prompts in one run):
40+
41+
```
42+
python end2end.py --query-type CustomVoice --use-batch-sample
43+
```
44+
45+
### VoiceDesign
46+
47+
Single sample:
48+
49+
```
50+
python end2end.py --query-type VoiceDesign
51+
```
52+
53+
Batch sample:
54+
55+
```
56+
python end2end.py --query-type VoiceDesign --use-batch-sample
57+
```
58+
59+
### Base (Voice Clone)
60+
61+
Single sample:
62+
63+
```
64+
python end2end.py --query-type Base
65+
```
66+
67+
Batch sample:
68+
69+
```
70+
python end2end.py --query-type Base --use-batch-sample
71+
```
72+
73+
Mode selection for Base:
74+
75+
- `--mode-tag icl` (default): standard mode
76+
- `--mode-tag xvec_only`: enable `x_vector_only_mode` in the request
77+
78+
Examples:
79+
80+
```
81+
python end2end.py --query-type Base --mode-tag icl
82+
```
83+
84+
## Notes
85+
86+
- The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
87+
- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.
88+
89+
## Example materials
90+
91+
??? abstract "end2end.py"
92+
``````py
93+
--8<-- "examples/offline_inference/qwen3_tts/end2end.py"
94+
``````

docs/user_guide/examples/online_serving/text_to_image.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ Use `extra_body` to pass generation parameters:
119119
| `seed` | int | None | Random seed (reproducible) |
120120
| `negative_prompt` | str | None | Negative prompt |
121121
| `num_outputs_per_prompt` | int | 1 | Number of images to generate |
122+
| `--cfg-parallel-size`. | int | 1 | Number of GPUs for CFG parallelism |
122123

123124
## Response Format
124125

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Qwen3-TTS Offline Inference
2+
3+
This directory contains an offline demo for running Qwen3 TTS models with vLLM Omni. It builds task-specific inputs and generates WAV files locally.
4+
5+
## Model Overview
6+
7+
Qwen3 TTS provides multiple task variants for speech generation:
8+
9+
- **CustomVoice**: Generate speech with a known speaker identity (speaker ID) and optional instruction.
10+
- **VoiceDesign**: Generate speech from text plus a descriptive instruction that designs a new voice.
11+
- **Base**: Voice cloning using a reference audio + reference transcript, with optional mode selection.
12+
13+
## Setup
14+
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
15+
16+
## Quick Start
17+
18+
Run a single sample for a task:
19+
20+
```
21+
python end2end.py --query-type CustomVoice
22+
```
23+
24+
Generated audio files are saved to `output_audio/` by default.
25+
26+
## Task Usage
27+
28+
### CustomVoice
29+
30+
Single sample:
31+
32+
```
33+
python end2end.py --query-type CustomVoice
34+
```
35+
36+
Batch sample (multiple prompts in one run):
37+
38+
```
39+
python end2end.py --query-type CustomVoice --use-batch-sample
40+
```
41+
42+
### VoiceDesign
43+
44+
Single sample:
45+
46+
```
47+
python end2end.py --query-type VoiceDesign
48+
```
49+
50+
Batch sample:
51+
52+
```
53+
python end2end.py --query-type VoiceDesign --use-batch-sample
54+
```
55+
56+
### Base (Voice Clone)
57+
58+
Single sample:
59+
60+
```
61+
python end2end.py --query-type Base
62+
```
63+
64+
Batch sample:
65+
66+
```
67+
python end2end.py --query-type Base --use-batch-sample
68+
```
69+
70+
Mode selection for Base:
71+
72+
- `--mode-tag icl` (default): standard mode
73+
- `--mode-tag xvec_only`: enable `x_vector_only_mode` in the request
74+
75+
Examples:
76+
77+
```
78+
python end2end.py --query-type Base --mode-tag icl
79+
```
80+
81+
## Notes
82+
83+
- The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
84+
- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.

0 commit comments

Comments
 (0)