[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase#1583
[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase#1583JuanPZuluaga wants to merge 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
|
Can we reduce TTFP by adjusting chunk_size? |
Indeed, but then:
With this approach, at least, we know that the TTFA can go below 500ms, without compromising too much the quality and not too much overhead. We could even increase the chunk size to higher values to have better audio quality without increasing TTFA. |
Signed-off-by: pablo <pablo@agigo.ai>
|
please let me know what you think @amy-why-3459 @linyueqian. TTFC can go to below ~300ms Also, should i add some test? |
|
Thank you so much for your contribution, it's a great idea. May I ask what your test scenario is? For example, concurrency and input/output length? |
|
Do you mean a real use case? If it's that, ideally, we would like to reduce TTFC on high concurrency loads for voice assistants, where we need very low latency for generating the first audio to the user. For instance, in batched offline decoding scenarios, i wouldn't see much importance to this value. |
…size Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
|
It makes sense to me that there is Additionally, I have a question: is it better to implement initial_codec_chunk_frames as a stage-level configuration or a request-level configuration? Or could both exist, with the request-level config taking higher priority? Does this make sense? Welcome to discuss. |
this is a good idea actually, though it means adding yet more params to be set request-level. Let me know what you think, and we can implement it. |
Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
|
I'd like to ask about the test scenarios in your Test Result. Also, could you please adapt this solution to qwen3-omni and mimo_audio as well? |
|
what are the definitions of TTFA/TTFP? |
Sorry about that i've mixed both, but they mean the same: Time to First Audio or Time to first Packet. So far, i hear glitches while doing "live streaming" due to the fact that the model cannot keep with real time processing (meaning RTFx is less than one). I expect it to be fast once we get the different models compiled, etc. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Related to #938.
In this PR, we lower the TTFA of Qwen3TTS by having a smaller number of frames required to start to stream the output. Ideally, this would help to perceive a faster response of the whole system by reducing the TTFA.
The current implementation is not too flexible, as if one wants to reduce the TTFA, we would end up always decoding with very small chunks, which can reduce audio quality.
initial_chunk_sizewhich dictates the chunk rate used to generate audio. we can call this phase "warmup phase":initial_chunk_size + full left contextchunk_size + left_context_sizePS: i decided to not add support of
left_context_size=-1(aka full left context), cause we might get OOM in very long sequences (too large context), though the user can just increase left context, to let's say 10s.I will add some audio samples later.
Test Plan
Test Result
initial_codec_chunk_frames = 0, 2, 5, 10, 15, 20"; 0 means noinitial_codec_chunk_framesicmeans, vlaue set forinitial_codec_chunk_framesTTFA (Time To First Audio) in milliseconds
Total Generation Time (ms)
Inter-Chunk Time (ms)
output_1_ic_0.wav
output_1_ic_2.wav
output_1_ic_5.wav
output_1_ic_10.wav
output_1_ic_15.wav
output_1_ic_20.wav
Note that
ic_15andic_20, would yield similar results.EDIT: removed the first sample (in the tables) due to overhead in compile.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)