[Feat] Support async_chunk additional_information delivery to V2 model runner by Sy0307 · Pull Request #2607 · vllm-project/vllm-omni

Sy0307 · 2026-04-08T19:31:36Z

Purpose

Fix async_chunk mode producing garbage/short audio in V2 model runner.

Root cause: additional_information (containing thinker_decode_embeddings and thinker_output_token_ids) was never propagated from the scheduler's CachedRequestData to the runner's intermediate_buffer during decode steps. The chunk_transfer_adapter correctly polled data from SharedMemoryConnector and attached it to scheduled_cached_reqs.additional_information, but GPUModelRunner.update_requests() does not handle this field — so the data was silently dropped.

Additionally fixes three correctness issues found during review:

_handle_async_chunk_updates passed raw AdditionalInformationPayload objects to intermediate_buffer.update(), which expects dict — causing AttributeError when payload is not pre-resolved
Inline deserialization in scheduler only preserved list_data, silently dropping tensor_data and scalar_data entries
cleanup() (sender+receiver) replaced cleanup_receiver() in the failed-KV-load path, risking race conditions with background save threads

Test Plan

Qwen3-Omni-30B async_chunk end-to-end (use_audio query, 2xH20)
ASR verification: Whisper transcription matches expected text output
Audio duration: 22.04s (previously 4.57s with the bug)
Sync mode regression check (non-async_chunk path unchanged)

Test Result

Before fix: Talker sees thinker_output_token_ids=[], thinker_decode_embeddings=None -> early EOS after ~33 decode steps -> 4.57s noise audio

After fix: Talker correctly receives incremental thinker data -> 329+ decode steps -> 22.04s audio, ASR output:

"The audio contains a man reciting the nursery rhyme Mary had a little lamb. He begins by saying the first words I spoke in the original phonograph before reciting the rhyme. Mary had a little lamb. Its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go."

…nner - Add update_requests() to OmniGPUModelRunner to propagate additional_information from scheduler to intermediate_buffer - Use _resolve_additional_information for AdditionalInformationPayload deserialization in both AR and generation runners - Revert cleanup() to cleanup_receiver() for concurrent safety - Fix _safe_get_rope control flow (remove exception-as-goto pattern) - Add Talker M-RoPE fallback returning 3D sequential positions Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 changed the title ~~[Bugfix] Fix async_chunk additional_information delivery to V2 model runner~~ [Feat] Support async_chunk additional_information delivery to V2 model runner Apr 8, 2026

Sy0307 force-pushed the fix/v2-improvements-on-2522 branch from 4e80cc4 to a6ef196 Compare April 8, 2026 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607
Sy0307 wants to merge 1 commit intovllm-project:dev/migrate-MR-v2from
Sy0307:fix/v2-improvements-on-2522

Sy0307 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sy0307 commented Apr 8, 2026

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant