[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase by JuanPZuluaga · Pull Request #1583 · vllm-project/vllm-omni

JuanPZuluaga · 2026-03-01T19:26:10Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Related to #938.

In this PR, we lower the TTFA of Qwen3TTS by having a smaller number of frames required to start to stream the output. Ideally, this would help to perceive a faster response of the whole system by reducing the TTFA.

The current implementation is not too flexible, as if one wants to reduce the TTFA, we would end up always decoding with very small chunks, which can reduce audio quality.

what we add is initial_chunk_size which dictates the chunk rate used to generate audio. we can call this phase "warmup phase": initial_chunk_size + full left context
then, after we have collected enough frames, we move to the standard decoding phase: chunk_size + left_context_size

PS: i decided to not add support of left_context_size=-1 (aka full left context), cause we might get OOM in very long sequences (too large context), though the user can just increase left context, to let's say 10s.

I will add some audio samples later.

Test Plan

Test Result

benchmark of sending to 10 prompts to the model
I tested: initial_codec_chunk_frames = 0, 2, 5, 10, 15, 20" ; 0 means no initial_codec_chunk_frames
I ran the e2e script and added a timer in between chunks and print that:
ic means, vlaue set for initial_codec_chunk_frames
first req always takes more time, probably due to compilation?

TTFA (Time To First Audio) in milliseconds

Config	Req 1	Req 2	Req 3	Req 4	Req 5	Req 6	Req 7	Req 8	Req 9	Avg	Min	Max
ic_0	1663	1644	1638	1639	1675	1686	1610	1627	1646	1648	1610	1686
ic_2	190	171	160	157	161	203	157	170	166	170	157	203
ic_5	342	357	334	353	386	342	349	376	364	356	334	386
ic_10	665	674	648	727	701	696	705	708	659	687	648	727
ic_15	1144	1065	990	1472	979	957	954	1009	1019	1065	954	1472
ic_20	1393	1680	1530	1744	1762	1460	1634	1480	1552	1582	1393	1762

Total Generation Time (ms)

Config	Req 1	Req 2	Req 3	Req 4	Req 5	Req 6	Req 7	Req 8	Req 9	Avg
ic_0	4175	4785	4364	4752	4444	4769	4589	6245	4925	4783
ic_2	4180	4754	4239	4556	4411	4901	4565	6127	4894	4736
ic_5	4059	4788	4223	4871	4557	4514	4691	6212	5184	4789
ic_10	4057	4733	4299	4918	4392	4674	4714	6241	5117	4794
ic_15	4463	4906	4438	5251	4303	4719	4562	6196	4996	4870
ic_20	4932	5751	5100	6118	5486	5551	5702	7412	5917	5774

Inter-Chunk Time (ms)

Config	Avg	Min	Max	Std	Chunks
ic_0	1593	1523	1663	41	11
ic_2	273	116	1614	430	110
ic_5	621	296	1645	549	47
ic_10	1258	629	1679	469	25
ic_15	1598	1542	1699	46	18
ic_20	1929	1788	2216	129	16

output_1_ic_0.wav
output_1_ic_2.wav
output_1_ic_5.wav
output_1_ic_10.wav
output_1_ic_15.wav
output_1_ic_20.wav

Note that ic_15 and ic_20, would yield similar results.

EDIT: removed the first sample (in the tables) due to overhead in compile.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: pablo <pablo@agigo.ai>

amy-why-3459 · 2026-03-02T01:47:43Z

Can we reduce TTFP by adjusting chunk_size?

JuanPZuluaga · 2026-03-02T05:00:14Z

Can we reduce TTFP by adjusting chunk_size?

Indeed, but then:

lower chunk size, means more calls to the code2wav model. Let’s say we go from 25 (current default) to 5, we will end up having 5x more calls
we could play with increasing the left context to give more overall context, and account for the lower chunk size; but the pain point would still be there; TTFA would be high unless we go to reduce chunk size to very low values.

With this approach, at least, we know that the TTFA can go below 500ms, without compromising too much the quality and not too much overhead. We could even increase the chunk size to higher values to have better audio quality without increasing TTFA.

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga · 2026-03-02T08:40:37Z

please let me know what you think @amy-why-3459 @linyueqian. TTFC can go to below ~300ms

Also, should i add some test?

amy-why-3459 · 2026-03-02T08:49:30Z

Thank you so much for your contribution, it's a great idea. May I ask what your test scenario is? For example, concurrency and input/output length?

JuanPZuluaga · 2026-03-02T09:24:31Z

Do you mean a real use case? If it's that, ideally, we would like to reduce TTFC on high concurrency loads for voice assistants, where we need very low latency for generating the first audio to the user. For instance, in batched offline decoding scenarios, i wouldn't see much importance to this value.

…size Signed-off-by: pablo <pablo@agigo.ai>

… feat/qwen3tts-config-ttfp

Sy0307 · 2026-03-02T11:56:33Z

It makes sense to me that there is initial_codec_chunk_frames in the warm-up stage to reduce TTFA, and we need a similar scenario as well.

Additionally, I have a question: is it better to implement initial_codec_chunk_frames as a stage-level configuration or a request-level configuration? Or could both exist, with the request-level config taking higher priority? Does this make sense? Welcome to discuss.

JuanPZuluaga · 2026-03-02T12:16:10Z

It makes sense to me that there is initial_codec_chunk_frames in the warm-up stage to reduce TTFA, and we need a similar scenario as well.

Additionally, I have a question: is it better to implement initial_codec_chunk_frames as a stage-level configuration or a request-level configuration? Or could both exist, with the request-level config taking higher priority? Does this make sense? Welcome to discuss.

this is a good idea actually, though it means adding yet more params to be set request-level. Let me know what you think, and we can implement it.

Signed-off-by: pablo <pablo@agigo.ai>

… feat/qwen3tts-config-ttfp

amy-why-3459 · 2026-03-02T12:50:47Z

I'd like to ask about the test scenarios in your Test Result. Also, could you please adapt this solution to qwen3-omni and mimo_audio as well?

Signed-off-by: pablo <pablo@agigo.ai>

hsliuustc0106 · 2026-03-02T14:00:37Z

what are the definitions of TTFA/TTFP?

JuanPZuluaga · 2026-03-02T15:57:59Z

what are the definitions of TTFA/TTFP?

Sorry about that i've mixed both, but they mean the same: Time to First Audio or Time to first Packet. So far, i hear glitches while doing "live streaming" due to the fact that the model cannot keep with real time processing (meaning RTFx is less than one). I expect it to be fast once we get the different models compiled, etc.

reduce TTFA by lower initial codec frames required at start of decoding

cf7c02a

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga changed the title ~~[Feat][Qwen3TTS] increase TTFA by reduced initial_codec_frames at decoding time~~ [Feat][Qwen3TTS] reduce TTFA with flexible warmup phase Mar 1, 2026

pablo added 2 commits March 1, 2026 19:30

update docs

88ed021

Signed-off-by: pablo <pablo@agigo.ai>

update examples

c2f4550

Signed-off-by: pablo <pablo@agigo.ai>

add time to e2e script to compute TTFC

19f5f80

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga marked this pull request as ready for review March 2, 2026 08:33

JuanPZuluaga requested a review from hsliuustc0106 as a code owner March 2, 2026 08:33

pablo added 2 commits March 2, 2026 09:45

add a simple test for streaming decoding with variable initial chunk …

29e3416

…size Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

f75b3ef

… feat/qwen3tts-config-ttfp

pablo added 2 commits March 2, 2026 12:37

last warmup chunk must overlap with the normal path

ff6d6c4

Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

3a0a5d4

… feat/qwen3tts-config-ttfp

fix

d54d2bf

Signed-off-by: pablo <pablo@agigo.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase#1583

[Feat][Qwen3TTS] reduce TTFA with flexible warmup phase#1583
JuanPZuluaga wants to merge 9 commits intovllm-project:mainfrom
JuanPZuluaga:feat/qwen3tts-config-ttfp

JuanPZuluaga commented Mar 1, 2026 •

edited

Loading

Uh oh!

amy-why-3459 commented Mar 2, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

Uh oh!

amy-why-3459 commented Mar 2, 2026

Uh oh!

JuanPZuluaga commented Mar 2, 2026

Uh oh!

Sy0307 commented Mar 2, 2026

Uh oh!

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

Uh oh!

amy-why-3459 commented Mar 2, 2026

Uh oh!

hsliuustc0106 commented Mar 2, 2026

Uh oh!

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JuanPZuluaga commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TTFA (Time To First Audio) in milliseconds

Total Generation Time (ms)

Inter-Chunk Time (ms)

EDIT: removed the first sample (in the tables) due to overhead in compile.

Uh oh!

amy-why-3459 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amy-why-3459 commented Mar 2, 2026

Uh oh!

JuanPZuluaga commented Mar 2, 2026

Uh oh!

Sy0307 commented Mar 2, 2026

Uh oh!

JuanPZuluaga commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amy-why-3459 commented Mar 2, 2026

Uh oh!

hsliuustc0106 commented Mar 2, 2026

Uh oh!

JuanPZuluaga commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JuanPZuluaga commented Mar 1, 2026 •

edited

Loading

amy-why-3459 commented Mar 2, 2026 •

edited

Loading

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading

JuanPZuluaga commented Mar 2, 2026 •

edited

Loading