Skip to content

[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase#1583

Open
JuanPZuluaga wants to merge 9 commits intovllm-project:mainfrom
JuanPZuluaga:feat/qwen3tts-config-ttfp
Open

[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase#1583
JuanPZuluaga wants to merge 9 commits intovllm-project:mainfrom
JuanPZuluaga:feat/qwen3tts-config-ttfp

Conversation

@JuanPZuluaga
Copy link
Contributor

@JuanPZuluaga JuanPZuluaga commented Mar 1, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Related to #938.

In this PR, we lower the TTFA of Qwen3TTS by having a smaller number of frames required to start to stream the output. Ideally, this would help to perceive a faster response of the whole system by reducing the TTFA.

The current implementation is not too flexible, as if one wants to reduce the TTFA, we would end up always decoding with very small chunks, which can reduce audio quality.

  • what we add is initial_chunk_size which dictates the chunk rate used to generate audio. we can call this phase "warmup phase": initial_chunk_size + full left context
  • then, after we have collected enough frames, we move to the standard decoding phase: chunk_size + left_context_size

PS: i decided to not add support of left_context_size=-1 (aka full left context), cause we might get OOM in very long sequences (too large context), though the user can just increase left context, to let's say 10s.

I will add some audio samples later.

Test Plan

Test Result

  • benchmark of sending to 10 prompts to the model
  • I tested: initial_codec_chunk_frames = 0, 2, 5, 10, 15, 20" ; 0 means no initial_codec_chunk_frames
  • I ran the e2e script and added a timer in between chunks and print that:
  • ic means, vlaue set for initial_codec_chunk_frames
  • first req always takes more time, probably due to compilation?

TTFA (Time To First Audio) in milliseconds

Config Req 1 Req 2 Req 3 Req 4 Req 5 Req 6 Req 7 Req 8 Req 9 Avg Min Max
ic_0 1663 1644 1638 1639 1675 1686 1610 1627 1646 1648 1610 1686
ic_2 190 171 160 157 161 203 157 170 166 170 157 203
ic_5 342 357 334 353 386 342 349 376 364 356 334 386
ic_10 665 674 648 727 701 696 705 708 659 687 648 727
ic_15 1144 1065 990 1472 979 957 954 1009 1019 1065 954 1472
ic_20 1393 1680 1530 1744 1762 1460 1634 1480 1552 1582 1393 1762

Total Generation Time (ms)

Config Req 1 Req 2 Req 3 Req 4 Req 5 Req 6 Req 7 Req 8 Req 9 Avg
ic_0 4175 4785 4364 4752 4444 4769 4589 6245 4925 4783
ic_2 4180 4754 4239 4556 4411 4901 4565 6127 4894 4736
ic_5 4059 4788 4223 4871 4557 4514 4691 6212 5184 4789
ic_10 4057 4733 4299 4918 4392 4674 4714 6241 5117 4794
ic_15 4463 4906 4438 5251 4303 4719 4562 6196 4996 4870
ic_20 4932 5751 5100 6118 5486 5551 5702 7412 5917 5774

Inter-Chunk Time (ms)

Config Avg Min Max Std Chunks
ic_0 1593 1523 1663 41 11
ic_2 273 116 1614 430 110
ic_5 621 296 1645 549 47
ic_10 1258 629 1679 469 25
ic_15 1598 1542 1699 46 18
ic_20 1929 1788 2216 129 16

output_1_ic_0.wav
output_1_ic_2.wav
output_1_ic_5.wav
output_1_ic_10.wav
output_1_ic_15.wav
output_1_ic_20.wav

Note that ic_15 and ic_20, would yield similar results.

EDIT: removed the first sample (in the tables) due to overhead in compile.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@JuanPZuluaga JuanPZuluaga changed the title [Feat][Qwen3TTS] increase TTFA by reduced initial_codec_frames at decoding time [Feat][Qwen3TTS] reduce TTFA with flexible warmup phase Mar 1, 2026
pablo added 2 commits March 1, 2026 19:30
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
@amy-why-3459
Copy link
Contributor

amy-why-3459 commented Mar 2, 2026

Can we reduce TTFP by adjusting chunk_size?

@JuanPZuluaga
Copy link
Contributor Author

JuanPZuluaga commented Mar 2, 2026

Can we reduce TTFP by adjusting chunk_size?

Indeed, but then:

  • lower chunk size, means more calls to the code2wav model. Let’s say we go from 25 (current default) to 5, we will end up having 5x more calls
  • we could play with increasing the left context to give more overall context, and account for the lower chunk size; but the pain point would still be there; TTFA would be high unless we go to reduce chunk size to very low values.

With this approach, at least, we know that the TTFA can go below 500ms, without compromising too much the quality and not too much overhead. We could even increase the chunk size to higher values to have better audio quality without increasing TTFA.

Signed-off-by: pablo <pablo@agigo.ai>
@JuanPZuluaga JuanPZuluaga marked this pull request as ready for review March 2, 2026 08:33
@JuanPZuluaga
Copy link
Contributor Author

JuanPZuluaga commented Mar 2, 2026

please let me know what you think @amy-why-3459 @linyueqian. TTFC can go to below ~300ms

Also, should i add some test?

@amy-why-3459
Copy link
Contributor

Thank you so much for your contribution, it's a great idea. May I ask what your test scenario is? For example, concurrency and input/output length?

@JuanPZuluaga
Copy link
Contributor Author

Do you mean a real use case? If it's that, ideally, we would like to reduce TTFC on high concurrency loads for voice assistants, where we need very low latency for generating the first audio to the user. For instance, in batched offline decoding scenarios, i wouldn't see much importance to this value.

@Sy0307
Copy link
Contributor

Sy0307 commented Mar 2, 2026

It makes sense to me that there is initial_codec_chunk_frames in the warm-up stage to reduce TTFA, and we need a similar scenario as well.

Additionally, I have a question: is it better to implement initial_codec_chunk_frames as a stage-level configuration or a request-level configuration? Or could both exist, with the request-level config taking higher priority? Does this make sense? Welcome to discuss.

@JuanPZuluaga
Copy link
Contributor Author

JuanPZuluaga commented Mar 2, 2026

It makes sense to me that there is initial_codec_chunk_frames in the warm-up stage to reduce TTFA, and we need a similar scenario as well.

Additionally, I have a question: is it better to implement initial_codec_chunk_frames as a stage-level configuration or a request-level configuration? Or could both exist, with the request-level config taking higher priority? Does this make sense? Welcome to discuss.

this is a good idea actually, though it means adding yet more params to be set request-level. Let me know what you think, and we can implement it.

pablo added 2 commits March 2, 2026 12:37
@amy-why-3459
Copy link
Contributor

I'd like to ask about the test scenarios in your Test Result. Also, could you please adapt this solution to qwen3-omni and mimo_audio as well?

fix
Signed-off-by: pablo <pablo@agigo.ai>
@hsliuustc0106
Copy link
Collaborator

what are the definitions of TTFA/TTFP?

@JuanPZuluaga
Copy link
Contributor Author

JuanPZuluaga commented Mar 2, 2026

what are the definitions of TTFA/TTFP?

Sorry about that i've mixed both, but they mean the same: Time to First Audio or Time to first Packet. So far, i hear glitches while doing "live streaming" due to the fact that the model cannot keep with real time processing (meaning RTFx is less than one). I expect it to be fast once we get the different models compiled, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants