Skip to content

Make chunk_size and left_context_size configurable via YAML for async chunking#1423

Open
LJH-LBJ wants to merge 33 commits intovllm-project:mainfrom
LJH-LBJ:Supports-configurable-chunk_size-and-left_context_size
Open

Make chunk_size and left_context_size configurable via YAML for async chunking#1423
LJH-LBJ wants to merge 33 commits intovllm-project:mainfrom
LJH-LBJ:Supports-configurable-chunk_size-and-left_context_size

Conversation

@LJH-LBJ
Copy link
Contributor

@LJH-LBJ LJH-LBJ commented Feb 21, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
chunk_size: 4
left_context_size: 25

qwen3_omni_moe_async_chunk.yaml

async_chunk: true
stage_args:

  • stage_id: 0
    stage_type: llm # Use llm stage type to launch OmniLLM
    runtime:
    devices: "0"
    max_batch_size: 64
    engine_args:
    model_stage: thinker
    model_arch: Qwen3OmniMoeForConditionalGeneration
    worker_type: ar
    scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
    gpu_memory_utilization: 0.9
    enforce_eager: false
    trust_remote_code: true
    engine_output_type: latent # Output hidden states for talker
    distributed_executor_backend: "mp"
    enable_prefix_caching: false
    max_num_batched_tokens: 32768
    hf_config_name: thinker_config
    tensor_parallel_size: 1
    custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk
    final_output: true
    final_output_type: text
    is_comprehension: true
    default_sampling_params:
    temperature: 0.4
    top_p: 0.9
    top_k: 1
    max_tokens: 2048
    seed: 42
    detokenize: True
    repetition_penalty: 1.05

  • stage_id: 1
    stage_type: llm # Use llm stage type to launch OmniLLM
    runtime:
    devices: "1"
    max_batch_size: 64
    engine_args:
    model_stage: talker
    model_arch: Qwen3OmniMoeForConditionalGeneration
    worker_type: ar
    scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
    gpu_memory_utilization: 0.6
    enforce_eager: false
    trust_remote_code: true
    engine_output_type: latent # Output codec codes for code2wav
    enable_prefix_caching: false
    max_num_batched_tokens: 32768
    distributed_executor_backend: "mp"
    hf_config_name: talker_config
    custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
    async_chunk_config:
    chunk_size: 4 # code2wav decode chunk size
    left_context_size: 25 # code2wav left context size

    engine_input_source: [0]
    default_sampling_params:
    temperature: 0.9
    top_k: 50
    max_tokens: 4096
    seed: 42
    detokenize: False
    repetition_penalty: 1.05
    stop_token_ids: [2150]

  • stage_id: 2
    stage_type: llm # Use llm stage type to launch OmniLLM
    runtime:
    devices: "1"
    max_batch_size: 64
    engine_args:
    model_stage: code2wav
    model_arch: Qwen3OmniMoeForConditionalGeneration
    worker_type: generation
    scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
    enforce_eager: true
    trust_remote_code: true
    async_scheduling: false
    enable_prefix_caching: false
    engine_output_type: audio # Final output: audio waveform
    gpu_memory_utilization: 0.1
    distributed_executor_backend: "mp"
    max_num_batched_tokens: 51200 # [TODO] if max_num_batch_tokens < max_batch_size * 800, there will be precision problem.
    hf_config_name: thinker_config
    engine_input_source: [1]
    final_output: true
    final_output_type: audio
    default_sampling_params:
    temperature: 0.0
    top_p: 1.0
    top_k: -1
    max_tokens: 65536
    seed: 42
    detokenize: True
    repetition_penalty: 1.1

Purpose

Resolve: #1239

This PR enables flexible configuration of chunk_size and left_context_size for the Code2Wav pipeline by exposing them in the async_chunk_config section of the stage YAML file. Previously, these values were hardcoded to 25. Now, users can adjust them per stage directly in the YAML, allowing for easier tuning and experimentation. The change ensures that the values from YAML override the dataclass defaults, improving usability and modularity for multi-stage pipelines. No functional logic is altered; only configuration handling is updated.

绘图3

When the chunk_size is not equal to left_context_size
This is a demonstration of the calculation of left_context_size and code_predictor_codes

    chunk_size_config 4 left_context_size_config 25
n times length context_length left_context_size end_index code_predictor_codes
1 4 4 0 4 [0]+code_prompt_token_ids[-4:]
2 8 4 4 8 [4]+code_prompt_token_ids[-8:]
3 12 4 8 12 [8]+code_prompt_token_ids[-12:]
4 16 4 12 16 [12]+code_prompt_token_ids[-16:]
5 20 4 16 20 [16]+code_prompt_token_ids[-20:]
6 24 4 20 24 [20]+code_prompt_token_ids[-24:]
7 28 4 24 28 [24]+code_prompt_token_ids[-28:]
8 32 4 25 29 [25]+code_prompt_token_ids[-29:]
9 36 4 25 29 [25]+code_prompt_token_ids[-29:]
10 finish 37 1 25 26 [25]+code_prompt_token_ids[-26:]

Test Plan

qwen3-omni accuracy

vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 8014 --stage-configs-path ./vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml
python -m pytest -s -v tests/e2e/online_serving/test_qwen3_omni.py -m "advanced_model" --run-level "advanced_model"

def test_text_to_text_001(omni_server, openai_client) -> None:
  messages = dummy_messages_from_mix_data(system_prompt=get_system_prompt(), content_text=get_prompt())

  request_config = {
        "model": omni_server.model,
        "messages": messages,
        "stream": False,
        "modalities": ["text", "audio"],
        "key_words": {"text": ["beijing"]},
    }
  openai_client.send_request(request_config, request_num=10) # concurency = 10

qwen3-omni benchmark

vllm bench serve \
  --omni \
  --dataset-name random \
  --port 50146 \
  --max-concurrency 10 \
  --model /workspace/models/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --num-prompts 100 \
  --random-input-len 100 \
  --ignore-eos \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --random-output-len 100 \
  --extra_body '{"modalities": ["text", "audio"]}'

qwen3-tts

python3 examples/offline_inference/qwen3_tts/end2end.py --query-type CustomVoice --txt-prompts examples/offline_inference/qwen3_tts/benchmark_prompts.txt --batch-size 4 --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml --log-stats

Test Result

qwen3-omni accuracy

==================== 2 passed, 19 warnings in 533.13s (0:08:53) ====================

qwen3-omni benchmark

Mean AUDIO_TTFP (ms) 2211 ->1175

main branch

============ Serving Benchmark Result ============							
Successful requests:                     100       							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  184.16    							
Request throughput (req/s):              0.54      							
Peak concurrent requests:                13.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          17828.85  							
Median E2EL (ms):                        18270.84  							
P99 E2EL (ms):                           23407.54  							
================== Text Result ===================							
Total input tokens:                      10000     							
Total generated tokens:                  10000     							
Output token throughput (tok/s):         54.30     							
Peak output token throughput (tok/s):    229.00    							
Peak concurrent requests:                13.00     							
Total Token throughput (tok/s):          108.60    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          255.57    							
Median TTFT (ms):                        104.66    							
P99 TTFT (ms):                           1562.96   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          20.65     							
Median TPOT (ms):                        16.71     							
P99 TPOT (ms):                           46.14     							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           20.51     							
Median ITL (ms):                         14.89     							
P99 ITL (ms):                            82.74     							
================== Audio Result ==================							
Total audio duration generated(s):       2645.41   							
Total audio frames generated:            63490275  							
Audio throughput(audio duration/s):      14.37     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    2176.59   							
Median AUDIO_TTFP (ms):                  1682.14   							
P99 AUDIO_TTFP (ms):                     7423.52   							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.67      							
Median AUDIO_RTF:                        0.67      							
P99 AUDIO_RTF:                           0.77      							
==================================================	

opt branch chunk size = 25

============ Serving Benchmark Result ============							
Successful requests:                     100       							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  185.21    							
Request throughput (req/s):              0.54      							
Peak concurrent requests:                13.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          18157.44  							
Median E2EL (ms):                        18499.92  							
P99 E2EL (ms):                           24397.74  							
================== Text Result ===================							
Total input tokens:                      10000     							
Total generated tokens:                  10000     							
Output token throughput (tok/s):         53.99     							
Peak output token throughput (tok/s):    240.00    							
Peak concurrent requests:                13.00     							
Total Token throughput (tok/s):          107.99    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          263.80    							
Median TTFT (ms):                        107.63    							
P99 TTFT (ms):                           1578.93   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          22.07     							
Median TPOT (ms):                        17.91     							
P99 TPOT (ms):                           45.45     							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           21.91     							
Median ITL (ms):                         15.72     							
P99 ITL (ms):                            94.53     							
================== Audio Result ==================							
Total audio duration generated(s):       2634.61   							
Total audio frames generated:            63231075  							
Audio throughput(audio duration/s):      14.23     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    2211.34   							
Median AUDIO_TTFP (ms):                  1700.58   							
P99 AUDIO_TTFP (ms):                     6845.09   							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.69      							
Median AUDIO_RTF:                        0.67      							
P99 AUDIO_RTF:                           0.85      							
==================================================							

opt branch chunk size = 4

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             10        
Benchmark duration (s):                  198.91    
Request throughput (req/s):              0.50      
Peak concurrent requests:                15.00     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          19226.26  
Median E2EL (ms):                        18974.63  
P99 E2EL (ms):                           25783.24  
================== Text Result ===================
Total input tokens:                      10000     
Total generated tokens:                  10000     
Output token throughput (tok/s):         50.27     
Peak output token throughput (tok/s):    229.00    
Peak concurrent requests:                15.00     
Total Token throughput (tok/s):          100.55    
---------------Time to First Token----------------
Mean TTFT (ms):                          467.82    
Median TTFT (ms):                        230.00    
P99 TTFT (ms):                           2034.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.36     
Median TPOT (ms):                        17.98     
P99 TPOT (ms):                           52.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.22     
Median ITL (ms):                         16.68     
P99 ITL (ms):                            133.35    
================== Audio Result ==================
Total audio duration generated(s):       2507.92   
Total audio frames generated:            60190080  
Audio throughput(audio duration/s):      12.61     
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1175.26   
Median AUDIO_TTFP (ms):                  742.41    
P99 AUDIO_TTFP (ms):                     4445.73   
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.77      
Median AUDIO_RTF:                        0.72      
P99 AUDIO_RTF:                           1.04      
==================================================		

qwen3-tts

INFO 03-02 14:25:14 [stats.py:502] [Overall Summary]
INFO 03-02 14:25:14 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:14 [stats.py:502] | Field                       |      Value |
INFO 03-02 14:25:14 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:14 [stats.py:502] | e2e_requests                |          4 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_wall_time_ms            | 19,054.799 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_total_tokens            |        353 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_avg_time_per_request_ms |  4,763.700 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_avg_tokens_per_s        |     18.526 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_stage_0_wall_time_ms    |  9,777.997 |
INFO 03-02 14:25:14 [stats.py:502] | e2e_stage_1_wall_time_ms    |  9,277.590 |
INFO 03-02 14:25:14 [stats.py:502] +-----------------------------+------------+

INFO 03-02 14:25:25 [stats.py:502] [Overall Summary]
INFO 03-02 14:25:25 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:25 [stats.py:502] | Field                       |      Value |
INFO 03-02 14:25:25 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:25 [stats.py:502] | e2e_requests                |          4 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_wall_time_ms            | 11,006.011 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_total_tokens            |        457 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_avg_time_per_request_ms |  2,751.503 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_avg_tokens_per_s        |     41.523 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_stage_0_wall_time_ms    |  9,477.145 |
INFO 03-02 14:25:25 [stats.py:502] | e2e_stage_1_wall_time_ms    |  1,530.330 |
INFO 03-02 14:25:25 [stats.py:502] +-----------------------------+------------+

INFO 03-02 14:25:36 [stats.py:502] [Overall Summary]
INFO 03-02 14:25:36 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:36 [stats.py:502] | Field                       |      Value |
INFO 03-02 14:25:36 [stats.py:502] +-----------------------------+------------+
INFO 03-02 14:25:36 [stats.py:502] | e2e_requests                |          4 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_wall_time_ms            | 11,544.613 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_total_tokens            |        482 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_avg_time_per_request_ms |  2,886.153 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_avg_tokens_per_s        |     41.751 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_stage_0_wall_time_ms    | 10,181.359 |
INFO 03-02 14:25:36 [stats.py:502] | e2e_stage_1_wall_time_ms    |  1,365.293 |
INFO 03-02 14:25:36 [stats.py:502] +-----------------------------+------------+

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
…-chunk_size-and-left_context_size

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
LJH-LBJ and others added 2 commits February 21, 2026 22:46
…ntext_size

Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9fe559c2fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
hf_config_name: str | None = None
custom_process_next_stage_input_func: str | None = None
stage_connector_spec: dict[str, Any] = field(default_factory=dict)
async_chunk: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove async_chunk field here? It seems that it's not consistent with model.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep async_chunk as before. Take it out from async_chunk_config

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I have a few comments.

"async_chunk": False,
"chunk_size": 25,
"left_context_size": 25,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old async_chunk: bool = False field was replaced by this dict, but create_model_config on line 351 still does async_chunk=self.async_chunk. Since async_chunk is no longer a standalone field on AsyncOmniEngineArgs, this will raise AttributeError at runtime. (gcanlin flagged something similar.) It seems like async_chunk should either stay as a separate field or be derived here, e.g. self.async_chunk_config.get("async_chunk", False).

stage_connector_spec: dict[str, Any] = field(default_factory=dict)
async_chunk: bool = False
async_chunk_config: dict[str, Any] = field(
default_factory=lambda: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nesting async_chunk (a boolean) inside async_chunk_config (a dict) alongside chunk_size/left_context_size (integers) feels like mixing concerns. OmniModelConfig keeps async_chunk: bool and async_chunk_config: dict as separate fields. Would it be cleaner to keep async_chunk as its own field here too and have async_chunk_config only hold chunk_size and left_context_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will keep async_chunk as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

else:
logger.warning("No additional_information provided to code2wav stage.")
audio_tensors = self.generate_audio(codes, voice_type, left_context_size=left_context_size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning fires for every non-async-chunk call (or whenever additional_information is None), which could be very noisy in production. Is this intentional for debugging only, or should it be logger.debug instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will use logger.debug instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

left_context_size = async_chunk_config.get("left_context_size", 25)
logger.warning(
"Left context size for async chunking is not provided, falling back to config default: %s",
left_context_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here -- this warning fires every time the fallback path is taken, which is the normal path when additional_information does not include left_context_size. In a streaming scenario this could log on every chunk. Would logger.debug or a one-time warning be more appropriate?

wav_chunk = batch_wav[idx, :, context_size * self.total_upsample : code_seq_len * self.total_upsample]
# Remove context from output (left_context_size * total_upsample samples)
wav_chunk = batch_wav[idx, :, left_context_size * self.total_upsample : code_seq_len * self.total_upsample]
wavs.append(wav_chunk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size is declared as a parameter of chunked_decode_streaming but is no longer used anywhere in the method body after this change. Is that intentional? If it is only there for API compatibility, might be worth a comment or removing it entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will removed chunk_size entirely

transfer_manager.code_prompt_token_ids[request_id].append(codec_codes)
length = len(transfer_manager.code_prompt_token_ids[request_id])
chunk_length = length % chunk_size
chunk_length = length % chunk_size_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when length == context_length (i.e., the first chunk)? min(0, left_context_size_config) gives 0, so end_index = 0 + chunk_size_config. That looks correct. But for the very last chunk when chunk_length != 0, context_length = chunk_length which could be small. Then left_context_size = min(length - chunk_length, left_context_size_config). Have you verified this against the table in the PR description for the final chunk case?

Copy link
Contributor Author

@LJH-LBJ LJH-LBJ Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the first chunk gets length == context_length, end_index = min(context_length, left_context_size + context_length) = min(4, 0+4), i.e. context_length=4.

when the last chunk is smaller than chunk_size_config, left_context_size = min(length - chunk_length, left_context_size_config) = min(37-1, 25) = 25 in the table in PR.

I think it is correct now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, that tracks. Thanks for walking through it.

# Pass additional fields (like left_context_size) to the request
req.additional_information = {
k: v for k, v in payload_data.items() if k not in ("code_predictor_codes", "finished")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dict comprehension filters out code_predictor_codes and finished from payload_data. If new keys are added to the payload in the future, they will silently flow into additional_information. Would an explicit allowlist (e.g. only picking left_context_size) be safer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will just pick left_context_size

distributed_executor_backend: "mp"
hf_config_name: talker_config
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
async_chunk_config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is under stage 1 (talker), which is where chunks are produced. But generate_audio in the code2wav stage also reads async_chunk_config from its own model_config (as a fallback when left_context_size is None). Since stage 2 does not have async_chunk_config in its YAML, it will silently use the default (25). Should async_chunk_config also be set under stage 2, or should the fallback in generate_audio be removed to make the additional_information path the only source of truth?

Copy link
Contributor Author

@LJH-LBJ LJH-LBJ Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, chunk_size is no longer needed here. I will remove the fallback in generate_audio and only use the left_context_size from additional_information. So we needn't set async_chunk_config in stage 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@hsliuustc0106
Copy link
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link

🤖 VLLM-Omni PR Review

Code Review: Make chunk_size and left_context_size configurable via YAML for async chunking

1. Overview

This PR makes chunk_size and left_context_size configurable via YAML for the async chunking feature in the Code2Wav pipeline. The changes introduce a new async_chunk_config dictionary field that flows through the configuration system to the actual audio generation code.

Overall Assessment: The PR has a clear purpose and the implementation follows a reasonable pattern. However, there are several concerns around edge case handling, missing validation, and incomplete test documentation that should be addressed before merging.


2. Code Quality

Positive Aspects

  • Good documentation in the config dataclass
  • Clean separation of configuration from hardcoded values
  • Reasonable default values maintained for backward compatibility

Issues and Concerns

Critical: Logic change in qwen3_omni_code2wav.py

# OLD CODE (removed):
if code_seq_len <= chunk_size:
    context_size = 0
else:
    context_size = left_context_size

# NEW CODE:
wav_chunk = batch_wav[idx, :, left_context_size * self.total_upsample : code_seq_len * self.total_upsample]

This removes the safety check for short sequences. If left_context_size > code_seq_len, the slice would be invalid (start > end). While the PR description shows that left_context_size is calculated as min(length - context_length, left_context_size_config) upstream, this defensive check should be preserved or explicitly documented as no longer needed.

Potential negative left_context_size:

left_context_size = min(length - context_length, left_context_size_config)

If length < context_length (edge case), this produces a negative value. Add a max(0, ...) guard:

left_context_size = max(0, min(length - context_length, left_context_size_config))

Overwriting additional_information:

req.additional_information = {
    k: v for k, v in payload_data.items() if k not in ("code_predictor_codes", "finished")
}

This completely replaces any existing additional_information. Consider merging instead:

if req.additional_information is None:
    req.additional_information = {}
req.additional_information.update({
    k: v for k, v in payload_data.items() if k not in ("code_predictor_codes", "finished")
})

Missing input validation:

async_chunk_config: dict[str, Any] = field(
    default_factory=lambda: {
        "chunk_size": 25,
        "left_context_size": 25,
    }
)

No validation that chunk_size and left_context_size are positive integers. Consider adding a __post_init__ validator:

def __post_init__(self):
    if self.async_chunk:
        chunk_size = self.async_chunk_config.get("chunk_size", 25)
        left_context_size = self.async_chunk_config.get("left_context_size", 25)
        if not isinstance(chunk_size, int) or chunk_size <= 0:
            raise ValueError(f"chunk_size must be a positive integer, got {chunk_size}")
        if not isinstance(left_context_size, int) or left_context_size <= 0:
            raise ValueError(f"left_context_size must be a positive integer, got {left_context_size}")

3. Architecture & Design

Positive Aspects

  • Configuration flows cleanly from YAML → EngineArgs → ModelConfig → Runtime
  • Good use of additional_information to pass runtime parameters

Concerns

Warning message may be too verbose:

if left_context_size is None:
    left_context_size = async_chunk_config.get("left_context_size", 25)
    logger.warning(
        "Left context size for async chunking is not provided, falling back to config default: %s",
        left_context_size,
    )

This warning will fire for non-async-chunk code paths. Consider checking if async_chunk is enabled first, or using logger.debug() instead.

Type hints could be more specific:

async_chunk_config: dict[str, Any]

Consider using a TypedDict or dataclass for better type safety:

from typing import TypedDict

class AsyncChunkConfig(TypedDict, total=False):
    chunk_size: int
    left_context_size: int

4. Security & Safety

  • No obvious security vulnerabilities
  • Resource management appears sound (existing lock mechanisms preserved)
  • No user input directly affects these configuration values (internal YAML config)

5. Testing & Documentation

Test Plan Issues

The PR states "Wait" for the test plan, which is incomplete. The checklist items are not filled in.

Required test coverage:

  1. Test with chunk_size != left_context_size
  2. Test with various chunk sizes (small, large)
  3. Test edge cases: very short audio sequences
  4. Test that YAML configuration is properly parsed
  5. Test backward compatibility (no config specified = defaults to 25)

Documentation

  • The docstring in OmniModelConfig is good
  • The YAML example in qwen3_omni_moe_async_chunk.yaml is helpful
  • Consider adding documentation about the relationship between chunk_size and left_context_size

6. Specific Suggestions

vllm_omni/model_executor/stage_input_processors/qwen3_omni.py:261

Add guard for negative values:

left_context_size = max(0, min(length - context_length, left_context_size_config))

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py:246-248

Add defensive check for slice bounds:

start_sample = left_context_size * self.total_upsample
end_sample = code_seq_len * self.total_upsample
if start_sample >= end_sample:
    start_sample = 0  # Fallback for edge cases
wav_chunk = batch_wav[idx, :, start_sample:end_sample]

vllm_omni/config/model.py:52-57

Add validation in __post_init__ for config values.

vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py:193-195

Consider merging instead of overwriting additional_information.

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py:507-511

Change to logger.debug() or add condition check:

if left_context_size is None:
    left_context_size = async_chunk_config.get("left_context_size", 25)
    if self.vllm_config.model_config.async_chunk:
        logger.debug("Using default left_context_size: %s", left_context_size)

7. Approval Status

Changes Requested

The PR has merit but requires the following before merging:

  1. Must fix: Add guard for potential negative left_context_size calculation
  2. Must fix: Add defensive check in qwen3_omni_code2wav.py for slice bounds
  3. Must fix: Complete the test plan with actual test commands and results
  4. Should fix: Add input validation for config values
  5. Should fix: Consider merging additional_information instead of overwriting
  6. Should fix: Adjust the warning log level or add condition check

The core functionality appears sound, but the edge case handling needs strengthening. Once these issues are addressed, the PR should be ready for merge.


This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

LJH-LBJ and others added 3 commits February 22, 2026 23:48
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@lishunyang12
Copy link
Contributor

Thanks for your fixes. I will take a look now. It is closer to ready state.

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Previous concerns resolved — async_chunk kept separate, allowlist for payload data, chunk_size param removed, logger.debug for the non-async path. Looks good. @princepride PTAL

@lishunyang12
Copy link
Contributor

Nice work!

# Remove context from output (context_size * total_upsample samples)
wav_chunk = batch_wav[idx, :, context_size * self.total_upsample : code_seq_len * self.total_upsample]
# Remove context from output (left_context_size * total_upsample samples)
wav_chunk = batch_wav[idx, :, left_context_size * self.total_upsample : code_seq_len * self.total_upsample]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In concurrent scenarios, the left_context_size of each request in code2wav may not be equal, so it is necessary to obtain the left_context_size of each request separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@amy-why-3459
Copy link
Contributor

Please provide the accuracy results under high concurrency scenarios, and also provide the performance test results when chunk_size=4 and chunk_size=25.

@hsliuustc0106
Copy link
Collaborator

@vllm-omni-reviewer

LJH-LBJ and others added 3 commits February 24, 2026 17:22
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
…-chunk_size-and-left_context_size

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
LJH-LBJ and others added 4 commits February 26, 2026 12:17
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
…ntext_size

Signed-off-by: Junhong Liu <ljh_lbj@163.com>
distributed_executor_backend: "mp"
hf_config_name: talker_config
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
async_chunk_config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configurations for chunk_size and left_context_size should be consistent with those for TTS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

transfer_manager: OmniChunkTransferAdapter,
pooling_output: dict[str, Any],
request: OmniEngineCoreRequest,
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk_size configuration should be placed within the transfer_manager; do not pass it to the talker2code2wav_async_chunk function. Obtain the parameter from the transfer_manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

LJH-LBJ and others added 6 commits February 27, 2026 11:33
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
…-chunk_size-and-left_context_size

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
…ntext_size

Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments. Most of the earlier feedback has been addressed, but there are a few remaining nits.

scheduled_spec_decode_tokens: dict[str, list[int]] = {}
scheduled_encoder_inputs: dict[str, list[int]] = {}
cached_prompt_token_ids: dict[str, list[int]] = {}
cached_additional_information: dict[str, list[int]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong type hint -- this stores dict | None values, not list[int]. Should be dict[str, dict | None].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""

prompt_token_ids: dict[str, list[int]]
additional_information: dict[str, list[int]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- should be dict[str, dict | None] to match what the scheduler actually puts in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
if not left_context_size:
logger.warning(
"left_context_size is None in chunked_decode_streaming;"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing space before the semicolon causes the log parts to run together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# cached requests. This is required for stages without preprocess
# (e.g., code2wav) so runtime_additional_information can be refreshed
# from scheduler cached infos on every step.
if hasattr(self.model, "has_preprocess") or hasattr(self.model, "enable_update_additional_information"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check has_preprocess here when enable_update_additional_information is the intended gate? Models with has_preprocess=True but no enable_update_additional_information would hit this path unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original logic, when has_preprocess=True, update_additional_information can be called. Now I don’t want to change the original logic; I just want the code2wav stage to also be able to update additional_information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, makes sense.

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
… of https://github.com/LJH-LBJ/vllm-omni into Supports-configurable-chunk_size-and-left_context_size

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@amy-why-3459
Copy link
Contributor

Please check if the changes to TTS meet expectations. @Sy0307 @linyueqian

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@linyueqian
Copy link
Contributor

Checked the TTS-side changes — overall LGTM. The move from magic input_ids[0] header to additional_information is a clean improvement and aligns TTS with Omni's approach.

Two minor things to flag:

  1. Missing max(0, ...) guard in Omni stage processor (qwen3_omni.py:261):

    left_context_size = min(length - context_length, left_context_size_config)

    Could go negative if length < context_length. TTS processor already has max(0, ...) — Omni should match.

  2. Index safety in TTS code2wav forward — the left_context_size list is built from runtime_additional_information but there's no guarantee it matches request_ids_list length. If fewer entries exist, some requests silently get left_context_size=0. Might want an explicit length check or warning.

Other than that, the enable_update_additional_information flag addition and gpu_model_runner gating look correct and necessary for code2wav (since has_preprocess=False).

LJH-LBJ and others added 2 commits February 28, 2026 12:05
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@LJH-LBJ
Copy link
Contributor Author

LJH-LBJ commented Feb 28, 2026

Checked the TTS-side changes — overall LGTM. The move from magic input_ids[0] header to additional_information is a clean improvement and aligns TTS with Omni's approach.

Two minor things to flag:

  1. Missing max(0, ...) guard in Omni stage processor (qwen3_omni.py:261):
    left_context_size = min(length - context_length, left_context_size_config)
    Could go negative if length < context_length. TTS processor already has max(0, ...) — Omni should match.
  2. Index safety in TTS code2wav forward — the left_context_size list is built from runtime_additional_information but there's no guarantee it matches request_ids_list length. If fewer entries exist, some requests silently get left_context_size=0. Might want an explicit length check or warning.

Other than that, the enable_update_additional_information flag addition and gpu_model_runner gating look correct and necessary for code2wav (since has_preprocess=False).

  1. Done

  2. I check the length of left_context_size and seq_token_counts in chunked_decode_streaming weather equal or not. If not, there will be a warning.

@Sy0307
Copy link
Contributor

Sy0307 commented Feb 28, 2026

LGTM. Plz add test results of Qwen3-tts .

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@LJH-LBJ
Copy link
Contributor Author

LJH-LBJ commented Mar 2, 2026

LGTM. Plz add test results of Qwen3-tts .

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Supports configurable chunk_size and left_context_size

7 participants