Make chunk_size and left_context_size configurable via YAML for async chunking by LJH-LBJ · Pull Request #1423 · vllm-project/vllm-omni

LJH-LBJ · 2026-02-21T14:43:19Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
chunk_size: 4
left_context_size: 25

qwen3_omni_moe_async_chunk.yaml

async_chunk: true
stage_args:

stage_id: 0
stage_type: llm # Use llm stage type to launch OmniLLM
runtime:
devices: "0"
max_batch_size: 64
engine_args:
model_stage: thinker
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.9
enforce_eager: false
trust_remote_code: true
engine_output_type: latent # Output hidden states for talker
distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
hf_config_name: thinker_config
tensor_parallel_size: 1
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk
final_output: true
final_output_type: text
is_comprehension: true
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
stage_id: 1
stage_type: llm # Use llm stage type to launch OmniLLM
runtime:
devices: "1"
max_batch_size: 64
engine_args:
model_stage: talker
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.6
enforce_eager: false
trust_remote_code: true
engine_output_type: latent # Output codec codes for code2wav
enable_prefix_caching: false
max_num_batched_tokens: 32768
distributed_executor_backend: "mp"
hf_config_name: talker_config
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
async_chunk_config:
chunk_size: 4 # code2wav decode chunk size
left_context_size: 25 # code2wav left context size
engine_input_source: [0]
default_sampling_params:
temperature: 0.9
top_k: 50
max_tokens: 4096
seed: 42
detokenize: False
repetition_penalty: 1.05
stop_token_ids: [2150]
stage_id: 2
stage_type: llm # Use llm stage type to launch OmniLLM
runtime:
devices: "1"
max_batch_size: 64
engine_args:
model_stage: code2wav
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
enforce_eager: true
trust_remote_code: true
async_scheduling: false
enable_prefix_caching: false
engine_output_type: audio # Final output: audio waveform
gpu_memory_utilization: 0.1
distributed_executor_backend: "mp"
max_num_batched_tokens: 51200 # [TODO] if max_num_batch_tokens < max_batch_size * 800, there will be precision problem.
hf_config_name: thinker_config
engine_input_source: [1]
final_output: true
final_output_type: audio
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 65536
seed: 42
detokenize: True
repetition_penalty: 1.1

Purpose

Resolve: #1239

This PR enables flexible configuration of chunk_size and left_context_size for the Code2Wav pipeline by exposing them in the async_chunk_config section of the stage YAML file. Previously, these values were hardcoded to 25. Now, users can adjust them per stage directly in the YAML, allowing for easier tuning and experimentation. The change ensures that the values from YAML override the dataclass defaults, improving usability and modularity for multi-stage pipelines. No functional logic is altered; only configuration handling is updated.

When the chunk_size is not equal to left_context_size
This is a demonstration of the calculation of left_context_size and code_predictor_codes

		chunk_size_config	4	left_context_size_config	25
n times	length	context_length	left_context_size	end_index	code_predictor_codes
1	4	4	0	4	[0]+code_prompt_token_ids[-4:]
2	8	4	4	8	[4]+code_prompt_token_ids[-8:]
3	12	4	8	12	[8]+code_prompt_token_ids[-12:]
4	16	4	12	16	[12]+code_prompt_token_ids[-16:]
5	20	4	16	20	[16]+code_prompt_token_ids[-20:]
6	24	4	20	24	[20]+code_prompt_token_ids[-24:]
7	28	4	24	28	[24]+code_prompt_token_ids[-28:]
8	32	4	25	29	[25]+code_prompt_token_ids[-29:]
9	36	4	25	29	[25]+code_prompt_token_ids[-29:]
10 finish	37	1	25	26	[25]+code_prompt_token_ids[-26:]

Test Plan

qwen3-omni accuracy

vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 8014 --stage-configs-path ./vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

python -m pytest -s -v tests/e2e/online_serving/test_qwen3_omni.py -m "advanced_model" --run-level "advanced_model"

def test_text_to_text_001(omni_server, openai_client) -> None:
  messages = dummy_messages_from_mix_data(system_prompt=get_system_prompt(), content_text=get_prompt())

  request_config = {
        "model": omni_server.model,
        "messages": messages,
        "stream": False,
        "modalities": ["text", "audio"],
        "key_words": {"text": ["beijing"]},
    }
  openai_client.send_request(request_config, request_num=10) # concurency = 10

qwen3-omni benchmark

vllm bench serve \
  --omni \
  --dataset-name random \
  --port 50146 \
  --max-concurrency 10 \
  --model /workspace/models/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --num-prompts 100 \
  --random-input-len 100 \
  --ignore-eos \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --random-output-len 100 \
  --extra_body '{"modalities": ["text", "audio"]}'

qwen3-tts

python3 examples/offline_inference/qwen3_tts/end2end.py --query-type CustomVoice --txt-prompts examples/offline_inference/qwen3_tts/benchmark_prompts.txt --batch-size 4 --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml --log-stats

Test Result

qwen3-omni accuracy

==================== 2 passed, 19 warnings in 533.13s (0:08:53) ====================

qwen3-omni benchmark

Mean AUDIO_TTFP (ms) 2211 ->1175

main branch

============ Serving Benchmark Result ============							
Successful requests:                     100       							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  184.16    							
Request throughput (req/s):              0.54      							
Peak concurrent requests:                13.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          17828.85  							
Median E2EL (ms):                        18270.84  							
P99 E2EL (ms):                           23407.54  							
================== Text Result ===================							
Total input tokens:                      10000     							
Total generated tokens:                  10000     							
Output token throughput (tok/s):         54.30     							
Peak output token throughput (tok/s):    229.00    							
Peak concurrent requests:                13.00     							
Total Token throughput (tok/s):          108.60    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          255.57    							
Median TTFT (ms):                        104.66    							
P99 TTFT (ms):                           1562.96   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          20.65     							
Median TPOT (ms):                        16.71     							
P99 TPOT (ms):                           46.14     							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           20.51     							
Median ITL (ms):                         14.89     							
P99 ITL (ms):                            82.74     							
================== Audio Result ==================							
Total audio duration generated(s):       2645.41   							
Total audio frames generated:            63490275  							
Audio throughput(audio duration/s):      14.37     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    2176.59   							
Median AUDIO_TTFP (ms):                  1682.14   							
P99 AUDIO_TTFP (ms):                     7423.52   							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.67      							
Median AUDIO_RTF:                        0.67      							
P99 AUDIO_RTF:                           0.77      							
==================================================

opt branch chunk size = 25

============ Serving Benchmark Result ============							
Successful requests:                     100       							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  185.21    							
Request throughput (req/s):              0.54      							
Peak concurrent requests:                13.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          18157.44  							
Median E2EL (ms):                        18499.92  							
P99 E2EL (ms):                           24397.74  							
================== Text Result ===================							
Total input tokens:                      10000     							
Total generated tokens:                  10000     							
Output token throughput (tok/s):         53.99     							
Peak output token throughput (tok/s):    240.00    							
Peak concurrent requests:                13.00     							
Total Token throughput (tok/s):          107.99    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          263.80    							
Median TTFT (ms):                        107.63    							
P99 TTFT (ms):                           1578.93   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          22.07     							
Median TPOT (ms):                        17.91     							
P99 TPOT (ms):                           45.45     							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           21.91     							
Median ITL (ms):                         15.72     							
P99 ITL (ms):                            94.53     							
================== Audio Result ==================							
Total audio duration generated(s):       2634.61   							
Total audio frames generated:            63231075  							
Audio throughput(audio duration/s):      14.23     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    2211.34   							
Median AUDIO_TTFP (ms):                  1700.58   							
P99 AUDIO_TTFP (ms):                     6845.09   							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.69      							
Median AUDIO_RTF:                        0.67      							
P99 AUDIO_RTF:                           0.85      							
==================================================

opt branch chunk size = 4

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             10        
Benchmark duration (s):                  198.91    
Request throughput (req/s):              0.50      
Peak concurrent requests:                15.00     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          19226.26  
Median E2EL (ms):                        18974.63  
P99 E2EL (ms):                           25783.24  
================== Text Result ===================
Total input tokens:                      10000     
Total generated tokens:                  10000     
Output token throughput (tok/s):         50.27     
Peak output token throughput (tok/s):    229.00    
Peak concurrent requests:                15.00     
Total Token throughput (tok/s):          100.55    
---------------Time to First Token----------------
Mean TTFT (ms):                          467.82    
Median TTFT (ms):                        230.00    
P99 TTFT (ms):                           2034.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.36     
Median TPOT (ms):                        17.98     
P99 TPOT (ms):                           52.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.22     
Median ITL (ms):                         16.68     
P99 ITL (ms):                            133.35    
================== Audio Result ==================
Total audio duration generated(s):       2507.92   
Total audio frames generated:            60190080  
Audio throughput(audio duration/s):      12.61     
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1175.26   
Median AUDIO_TTFP (ms):                  742.41    
P99 AUDIO_TTFP (ms):                     4445.73   
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.77      
Median AUDIO_RTF:                        0.72      
P99 AUDIO_RTF:                           1.04      
==================================================

qwen3-tts

INFO 03-02 14:25:14 [stats.py:502] [Overall Summary]
INFO 03-02 14:25:14 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:14 [stats.py:502] | Field                       |      Value |
INFO 03-02 14:25:14 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:14 [stats.py:502] | e2e_requests                |          4 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_wall_time_ms            | 19,054.799 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_total_tokens            |        353 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_avg_time_per_request_ms |  4,763.700 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_avg_tokens_per_s        |     18.526 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_stage_0_wall_time_ms    |  9,777.997 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_stage_1_wall_time_ms    |  9,277.590 |
INFO 03-02 14:25:14 [stats.py:502] +-----------------------------+------------+

INFO 03-02 14:25:25 [stats.py:502] [Overall Summary]
INFO 03-02 14:25:25 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:25 [stats.py:502] | Field                       |      Value |
INFO 03-02 14:25:25 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:25 [stats.py:502] | e2e_requests                |          4 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_wall_time_ms            | 11,006.011 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_total_tokens            |        457 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_avg_time_per_request_ms |  2,751.503 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_avg_tokens_per_s        |     41.523 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_stage_0_wall_time_ms    |  9,477.145 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_stage_1_wall_time_ms    |  1,530.330 |
INFO 03-02 14:25:25 [stats.py:502] +-----------------------------+------------+

INFO 03-02 14:25:36 [stats.py:502] [Overall Summary]
INFO 03-02 14:25:36 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:36 [stats.py:502] | Field                       |      Value |
INFO 03-02 14:25:36 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:36 [stats.py:502] | e2e_requests                |          4 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_wall_time_ms            | 11,544.613 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_total_tokens            |        482 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_avg_time_per_request_ms |  2,886.153 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_avg_tokens_per_s        |     41.751 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_stage_0_wall_time_ms    | 10,181.359 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_stage_1_wall_time_ms    |  1,365.293 |
INFO 03-02 14:25:36 [stats.py:502] +-----------------------------+------------+

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please pasting the results comparison before and after, or e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…ntext_size Signed-off-by: Junhong Liu <ljh_lbj@163.com>

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9fe559c2fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/core/sched/omni_ar_scheduler.py

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

gcanlin · 2026-02-21T17:42:58Z

vllm_omni/engine/arg_utils.py

    hf_config_name: str | None = None
    custom_process_next_stage_input_func: str | None = None
    stage_connector_spec: dict[str, Any] = field(default_factory=dict)
-    async_chunk: bool = False


Should we remove async_chunk field here? It seems that it's not consistent with model.py

I will keep async_chunk as before. Take it out from async_chunk_config

lishunyang12

Thanks for the contribution, I have a few comments.

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/engine/arg_utils.py

+            "async_chunk": False,
+            "chunk_size": 25,
+            "left_context_size": 25,
+        }


The old async_chunk: bool = False field was replaced by this dict, but create_model_config on line 351 still does async_chunk=self.async_chunk. Since async_chunk is no longer a standalone field on AsyncOmniEngineArgs, this will raise AttributeError at runtime. (gcanlin flagged something similar.) It seems like async_chunk should either stay as a separate field or be derived here, e.g. self.async_chunk_config.get("async_chunk", False).

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/engine/arg_utils.py

    stage_connector_spec: dict[str, Any] = field(default_factory=dict)
-    async_chunk: bool = False
+    async_chunk_config: dict[str, Any] = field(
+        default_factory=lambda: {


Nesting async_chunk (a boolean) inside async_chunk_config (a dict) alongside chunk_size/left_context_size (integers) feels like mixing concerns. OmniModelConfig keeps async_chunk: bool and async_chunk_config: dict as separate fields. Would it be cleaner to keep async_chunk as its own field here too and have async_chunk_config only hold chunk_size and left_context_size?

Sure, I will keep async_chunk as before.

Sounds good.

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py

+            else:
+                logger.warning("No additional_information provided to code2wav stage.")
+            audio_tensors = self.generate_audio(codes, voice_type, left_context_size=left_context_size)



This warning fires for every non-async-chunk call (or whenever additional_information is None), which could be very noisy in production. Is this intentional for debugging only, or should it be logger.debug instead?

Sure, I will use logger.debug instead

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py

+                left_context_size = async_chunk_config.get("left_context_size", 25)
+                logger.warning(
+                    "Left context size for async chunking is not provided, falling back to config default: %s",
+                    left_context_size,


Same concern here -- this warning fires every time the fallback path is taken, which is the normal path when additional_information does not include left_context_size. In a streaming scenario this could log on every chunk. Would logger.debug or a one-time warning be more appropriate?

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py

-            wav_chunk = batch_wav[idx, :, context_size * self.total_upsample : code_seq_len * self.total_upsample]
+            # Remove context from output (left_context_size * total_upsample samples)
+            wav_chunk = batch_wav[idx, :, left_context_size * self.total_upsample : code_seq_len * self.total_upsample]
            wavs.append(wav_chunk)


chunk_size is declared as a parameter of chunked_decode_streaming but is no longer used anywhere in the method body after this change. Is that intentional? If it is only there for API compatibility, might be worth a comment or removing it entirely.

ok， I will removed chunk_size entirely

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/model_executor/stage_input_processors/qwen3_omni.py

    transfer_manager.code_prompt_token_ids[request_id].append(codec_codes)
    length = len(transfer_manager.code_prompt_token_ids[request_id])
-    chunk_length = length % chunk_size
+    chunk_length = length % chunk_size_config


What happens when length == context_length (i.e., the first chunk)? min(0, left_context_size_config) gives 0, so end_index = 0 + chunk_size_config. That looks correct. But for the very last chunk when chunk_length != 0, context_length = chunk_length which could be small. Then left_context_size = min(length - chunk_length, left_context_size_config). Have you verified this against the table in the PR description for the final chunk case?

when the first chunk gets length == context_length, end_index = min(context_length, left_context_size + context_length) = min(4, 0+4), i.e. context_length=4.

when the last chunk is smaller than chunk_size_config, left_context_size = min(length - chunk_length, left_context_size_config) = min(37-1, 25) = 25 in the table in PR.

I think it is correct now.

Ah got it, that tracks. Thanks for walking through it.

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py

+                # Pass additional fields (like left_context_size) to the request
+                req.additional_information = {
+                    k: v for k, v in payload_data.items() if k not in ("code_predictor_codes", "finished")
+                }


This dict comprehension filters out code_predictor_codes and finished from payload_data. If new keys are added to the payload in the future, they will silently flow into additional_information. Would an explicit allowlist (e.g. only picking left_context_size) be safer here?

OK, I will just pick left_context_size

lishunyang12 · 2026-02-21T21:11:11Z

vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

      distributed_executor_backend: "mp"
      hf_config_name: talker_config
      custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
+      async_chunk_config:


This config is under stage 1 (talker), which is where chunks are produced. But generate_audio in the code2wav stage also reads async_chunk_config from its own model_config (as a fallback when left_context_size is None). Since stage 2 does not have async_chunk_config in its YAML, it will silently use the default (25). Should async_chunk_config also be set under stage 2, or should the fallback in generate_audio be removed to make the additional_information path the only source of truth?

Yes, chunk_size is no longer needed here. I will remove the fallback in generate_audio and only use the left_context_size from additional_information. So we needn't set async_chunk_config in stage 2

Sounds good.

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

hsliuustc0106 · 2026-02-22T09:56:57Z

@vllm-omni-reviewer

github-actions · 2026-02-22T09:58:37Z

🤖 VLLM-Omni PR Review

Code Review: Make chunk_size and left_context_size configurable via YAML for async chunking

1. Overview

This PR makes chunk_size and left_context_size configurable via YAML for the async chunking feature in the Code2Wav pipeline. The changes introduce a new async_chunk_config dictionary field that flows through the configuration system to the actual audio generation code.

Overall Assessment: The PR has a clear purpose and the implementation follows a reasonable pattern. However, there are several concerns around edge case handling, missing validation, and incomplete test documentation that should be addressed before merging.

2. Code Quality

Positive Aspects

Good documentation in the config dataclass
Clean separation of configuration from hardcoded values
Reasonable default values maintained for backward compatibility

Issues and Concerns

Critical: Logic change in qwen3_omni_code2wav.py

# OLD CODE (removed):
if code_seq_len <= chunk_size:
    context_size = 0
else:
    context_size = left_context_size

# NEW CODE:
wav_chunk = batch_wav[idx, :, left_context_size * self.total_upsample : code_seq_len * self.total_upsample]

This removes the safety check for short sequences. If left_context_size > code_seq_len, the slice would be invalid (start > end). While the PR description shows that left_context_size is calculated as min(length - context_length, left_context_size_config) upstream, this defensive check should be preserved or explicitly documented as no longer needed.

Potential negative left_context_size:

left_context_size = min(length - context_length, left_context_size_config)

If length < context_length (edge case), this produces a negative value. Add a max(0, ...) guard:

left_context_size = max(0, min(length - context_length, left_context_size_config))

Overwriting additional_information:

req.additional_information = {
    k: v for k, v in payload_data.items() if k not in ("code_predictor_codes", "finished")
}

This completely replaces any existing additional_information. Consider merging instead:

if req.additional_information is None:
    req.additional_information = {}
req.additional_information.update({
    k: v for k, v in payload_data.items() if k not in ("code_predictor_codes", "finished")
})

Missing input validation:

async_chunk_config: dict[str, Any] = field(
    default_factory=lambda: {
        "chunk_size": 25,
        "left_context_size": 25,
    }
)

No validation that chunk_size and left_context_size are positive integers. Consider adding a __post_init__ validator:

def __post_init__(self):
    if self.async_chunk:
        chunk_size = self.async_chunk_config.get("chunk_size", 25)
        left_context_size = self.async_chunk_config.get("left_context_size", 25)
        if not isinstance(chunk_size, int) or chunk_size <= 0:
            raise ValueError(f"chunk_size must be a positive integer, got {chunk_size}")
        if not isinstance(left_context_size, int) or left_context_size <= 0:
            raise ValueError(f"left_context_size must be a positive integer, got {left_context_size}")

3. Architecture & Design

Positive Aspects

Configuration flows cleanly from YAML → EngineArgs → ModelConfig → Runtime
Good use of additional_information to pass runtime parameters

Concerns

Warning message may be too verbose:

if left_context_size is None:
    left_context_size = async_chunk_config.get("left_context_size", 25)
    logger.warning(
        "Left context size for async chunking is not provided, falling back to config default: %s",
        left_context_size,
    )

This warning will fire for non-async-chunk code paths. Consider checking if async_chunk is enabled first, or using logger.debug() instead.

Type hints could be more specific:

async_chunk_config: dict[str, Any]

Consider using a TypedDict or dataclass for better type safety:

from typing import TypedDict

class AsyncChunkConfig(TypedDict, total=False):
    chunk_size: int
    left_context_size: int

4. Security & Safety

No obvious security vulnerabilities
Resource management appears sound (existing lock mechanisms preserved)
No user input directly affects these configuration values (internal YAML config)

5. Testing & Documentation

Test Plan Issues

The PR states "Wait" for the test plan, which is incomplete. The checklist items are not filled in.

Required test coverage:

Test with chunk_size != left_context_size
Test with various chunk sizes (small, large)
Test edge cases: very short audio sequences
Test that YAML configuration is properly parsed
Test backward compatibility (no config specified = defaults to 25)

Documentation

The docstring in OmniModelConfig is good
The YAML example in qwen3_omni_moe_async_chunk.yaml is helpful
Consider adding documentation about the relationship between chunk_size and left_context_size

6. Specific Suggestions

`vllm_omni/model_executor/stage_input_processors/qwen3_omni.py:261`

Add guard for negative values:

left_context_size = max(0, min(length - context_length, left_context_size_config))

`vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py:246-248`

Add defensive check for slice bounds:

start_sample = left_context_size * self.total_upsample
end_sample = code_seq_len * self.total_upsample
if start_sample >= end_sample:
    start_sample = 0  # Fallback for edge cases
wav_chunk = batch_wav[idx, :, start_sample:end_sample]

`vllm_omni/config/model.py:52-57`

Add validation in __post_init__ for config values.

`vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py:193-195`

Consider merging instead of overwriting additional_information.

`vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py:507-511`

Change to logger.debug() or add condition check:

if left_context_size is None:
    left_context_size = async_chunk_config.get("left_context_size", 25)
    if self.vllm_config.model_config.async_chunk:
        logger.debug("Using default left_context_size: %s", left_context_size)

7. Approval Status

Changes Requested

The PR has merit but requires the following before merging:

Must fix: Add guard for potential negative left_context_size calculation
Must fix: Add defensive check in qwen3_omni_code2wav.py for slice bounds
Must fix: Complete the test plan with actual test commands and results
Should fix: Add input validation for config values
Should fix: Consider merging additional_information instead of overwriting
Should fix: Adjust the warning log level or add condition check

The core functionality appears sound, but the edge case handling needs strengthening. Once these issues are addressed, the PR should be ready for merge.

This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…ntext_size

lishunyang12 · 2026-02-22T15:55:08Z

Thanks for your fixes. I will take a look now. It is closer to ready state.

lishunyang12

LGTM. Previous concerns resolved — async_chunk kept separate, allowlist for payload data, chunk_size param removed, logger.debug for the non-async path. Looks good. @princepride PTAL

lishunyang12 · 2026-02-22T16:06:47Z

Nice work!

amy-why-3459 · 2026-02-24T03:48:58Z

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py

-            # Remove context from output (context_size * total_upsample samples)
-            wav_chunk = batch_wav[idx, :, context_size * self.total_upsample : code_seq_len * self.total_upsample]
+            # Remove context from output (left_context_size * total_upsample samples)
+            wav_chunk = batch_wav[idx, :, left_context_size * self.total_upsample : code_seq_len * self.total_upsample]


In concurrent scenarios, the left_context_size of each request in code2wav may not be equal, so it is necessary to obtain the left_context_size of each request separately.

amy-why-3459 · 2026-02-24T03:50:45Z

Please provide the accuracy results under high concurrency scenarios, and also provide the performance test results when chunk_size=4 and chunk_size=25.

hsliuustc0106 · 2026-02-24T07:06:36Z

@vllm-omni-reviewer

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…ntext_size

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…ntext_size Signed-off-by: Junhong Liu <ljh_lbj@163.com>

…ntext_size

amy-why-3459 · 2026-02-26T13:19:22Z

vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

      distributed_executor_backend: "mp"
      hf_config_name: talker_config
      custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
+      async_chunk_config:


The configurations for chunk_size and left_context_size should be consistent with those for TTS.

amy-why-3459 · 2026-02-26T13:28:01Z

vllm_omni/model_executor/stage_input_processors/qwen3_omni.py

+    transfer_manager: OmniChunkTransferAdapter,
    pooling_output: dict[str, Any],
    request: OmniEngineCoreRequest,
+    **kwargs,


The chunk_size configuration should be placed within the transfer_manager; do not pass it to the talker2code2wav_async_chunk function. Obtain the parameter from the transfer_manager.

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…ntext_size Signed-off-by: Junhong Liu <ljh_lbj@163.com>

lishunyang12

Left a couple comments. Most of the earlier feedback has been addressed, but there are a few remaining nits.

lishunyang12 · 2026-02-27T18:11:28Z

vllm_omni/core/sched/omni_generation_scheduler.py

        scheduled_spec_decode_tokens: dict[str, list[int]] = {}
        scheduled_encoder_inputs: dict[str, list[int]] = {}
        cached_prompt_token_ids: dict[str, list[int]] = {}
+        cached_additional_information: dict[str, list[int]] = {}


Wrong type hint -- this stores dict | None values, not list[int]. Should be dict[str, dict | None].

lishunyang12 · 2026-02-27T18:11:28Z

vllm_omni/core/sched/output.py

    """

    prompt_token_ids: dict[str, list[int]]
+    additional_information: dict[str, list[int]]


Same here -- should be dict[str, dict | None] to match what the scheduler actually puts in.

lishunyang12 · 2026-02-27T18:11:28Z

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py

        """
+        if not left_context_size:
+            logger.warning(
+                "left_context_size is None in chunked_decode_streaming;"


Nit: missing space before the semicolon causes the log parts to run together.

lishunyang12 · 2026-02-27T18:11:28Z

vllm_omni/worker/gpu_model_runner.py

+        # cached requests. This is required for stages without preprocess
+        # (e.g., code2wav) so runtime_additional_information can be refreshed
+        # from scheduler cached infos on every step.
+        if hasattr(self.model, "has_preprocess") or hasattr(self.model, "enable_update_additional_information"):


Why check has_preprocess here when enable_update_additional_information is the intended gate? Models with has_preprocess=True but no enable_update_additional_information would hit this path unnecessarily.

In the original logic, when has_preprocess=True, update_additional_information can be called. Now I don’t want to change the original logic; I just want the code2wav stage to also be able to update additional_information.

Fair enough, makes sense.

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

… of https://github.com/LJH-LBJ/vllm-omni into Supports-configurable-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

amy-why-3459 · 2026-02-28T02:27:25Z

Please check if the changes to TTS meet expectations. @Sy0307 @linyueqian

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

linyueqian · 2026-02-28T03:34:13Z

Checked the TTS-side changes — overall LGTM. The move from magic input_ids[0] header to additional_information is a clean improvement and aligns TTS with Omni's approach.

Two minor things to flag:

Missing max(0, ...) guard in Omni stage processor (qwen3_omni.py:261):
```
left_context_size = min(length - context_length, left_context_size_config)
```
Could go negative if length < context_length. TTS processor already has max(0, ...) — Omni should match.
Index safety in TTS code2wav forward — the left_context_size list is built from runtime_additional_information but there's no guarantee it matches request_ids_list length. If fewer entries exist, some requests silently get left_context_size=0. Might want an explicit length check or warning.

Other than that, the enable_update_additional_information flag addition and gpu_model_runner gating look correct and necessary for code2wav (since has_preprocess=False).

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

…ntext_size

LJH-LBJ · 2026-02-28T04:09:04Z

Checked the TTS-side changes — overall LGTM. The move from magic input_ids[0] header to additional_information is a clean improvement and aligns TTS with Omni's approach.

Two minor things to flag:
Missing max(0, ...) guard in Omni stage processor (qwen3_omni.py:261):
left_context_size = min(length - context_length, left_context_size_config)
Could go negative if length < context_length. TTS processor already has max(0, ...) — Omni should match.
Index safety in TTS code2wav forward — the left_context_size list is built from runtime_additional_information but there's no guarantee it matches request_ids_list length. If fewer entries exist, some requests silently get left_context_size=0. Might want an explicit length check or warning.
Other than that, the enable_update_additional_information flag addition and gpu_model_runner gating look correct and necessary for code2wav (since has_preprocess=False).

Done
I check the length of left_context_size and seq_token_counts in chunked_decode_streaming weather equal or not. If not, there will be a warning.

Sy0307 · 2026-02-28T04:32:20Z

LGTM. Plz add test results of Qwen3-tts .

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

LJH-LBJ · 2026-03-02T08:57:00Z

LGTM. Plz add test results of Qwen3-tts .

Done

LJH-LBJ added 6 commits February 11, 2026 17:55

opt

ede288c

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into Supports-configurable…

94766e1

…-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

change save_async

b5596a5

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

fix bug

bc21ac1

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

transfer letf_context_size by additional_information

e330708

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

del async_chunk_config in OmniEngineArgs

9fe559c

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

LJH-LBJ requested a review from hsliuustc0106 as a code owner February 21, 2026 14:43

LJH-LBJ and others added 2 commits February 21, 2026 22:46

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

d2a1c28

…ntext_size Signed-off-by: Junhong Liu <ljh_lbj@163.com>

fix pre-commit

3c73e35

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

chatgpt-codex-connector bot reviewed Feb 21, 2026

View reviewed changes

vllm_omni/core/sched/omni_ar_scheduler.py Outdated Show resolved Hide resolved

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py Outdated Show resolved Hide resolved

opt

3d23a72

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

gcanlin reviewed Feb 21, 2026

View reviewed changes

lishunyang12 suggested changes Feb 21, 2026

View reviewed changes

opt

ecd6e53

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

LJH-LBJ and others added 3 commits February 22, 2026 23:48

opt

9303949

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

fix pre-commit

395c286

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

2be2eff

…ntext_size

lishunyang12 approved these changes Feb 22, 2026

View reviewed changes

LJH-LBJ mentioned this pull request Feb 24, 2026

[Feature]: Optimize streaming output chunk_size JiusiServe/vllm-omni#115

Open

1 task

amy-why-3459 reviewed Feb 24, 2026

View reviewed changes

LJH-LBJ and others added 3 commits February 24, 2026 17:22

add batch left_context_size

e11bc9a

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into Supports-configurable…

38d74b6

…-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

c6bd3b0

…ntext_size

lishunyang12 mentioned this pull request Feb 25, 2026

[RFC] Multi-Stage Abort / Barge-in for Omni Models #1480

Closed

lishunyang12 mentioned this pull request Feb 25, 2026

[RFC] Streaming Audio Output for WebSocket TTS #1479

Open

LJH-LBJ and others added 4 commits February 26, 2026 12:17

update_additional_information in code2wave

af3d28d

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

opt

e654c58

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

74ed57c

…ntext_size Signed-off-by: Junhong Liu <ljh_lbj@163.com>

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

9ce0554

…ntext_size

amy-why-3459 reviewed Feb 26, 2026

View reviewed changes

LJH-LBJ and others added 6 commits February 27, 2026 11:33

add left_context_size for qwen3 tts

ea607bd

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Obtain the parameter from the transfer_manager

4eb97f0

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

remove async_chunk_config

602975d

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into Supports-configurable…

f5889a0

…-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

fix bug

4985891

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

c8b0e76

…ntext_size Signed-off-by: Junhong Liu <ljh_lbj@163.com>

lishunyang12 reviewed Feb 27, 2026

View reviewed changes

LJH-LBJ added 2 commits February 28, 2026 09:16

opt

54f31ae

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge branch 'Supports-configurable-chunk_size-and-left_context_size'…

7e77fed

… of https://github.com/LJH-LBJ/vllm-omni into Supports-configurable-chunk_size-and-left_context_size Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

LJH-LBJ added 2 commits February 28, 2026 10:48

opt

2af512e

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

opt

fb04697

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

LJH-LBJ and others added 2 commits February 28, 2026 12:05

opt

b1b441c

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Merge branch 'main' into Supports-configurable-chunk_size-and-left_co…

3692b96

…ntext_size

LJH-LBJ mentioned this pull request Feb 28, 2026

[RFC]: Benchmark data statistics #1361

Open

1 task

opt

1359e59

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

Conversation

LJH-LBJ commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LJH-LBJ Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LJH-LBJ Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Feb 22, 2026

Uh oh!

github-actions bot commented Feb 22, 2026

🤖 VLLM-Omni PR Review

Code Review: Make chunk_size and left_context_size configurable via YAML for async chunking

1. Overview

2. Code Quality

Positive Aspects

Issues and Concerns

3. Architecture & Design

Positive Aspects

Concerns

4. Security & Safety

5. Testing & Documentation

Test Plan Issues

Documentation

6. Specific Suggestions

vllm_omni/model_executor/stage_input_processors/qwen3_omni.py:261

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py:246-248

vllm_omni/config/model.py:52-57

LJH-LBJ commented Feb 21, 2026 •

edited

Loading

LJH-LBJ Feb 22, 2026 •

edited

Loading

LJH-LBJ Feb 22, 2026 •

edited

Loading

`vllm_omni/model_executor/stage_input_processors/qwen3_omni.py:261`

`vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py:246-248`

`vllm_omni/config/model.py:52-57`

`vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py:193-195`

`vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py:507-511`

lishunyang12 left a comment •

edited

Loading