Skip to content

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607

Draft
Sy0307 wants to merge 1 commit intovllm-project:dev/migrate-MR-v2from
Sy0307:fix/v2-improvements-on-2522
Draft

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607
Sy0307 wants to merge 1 commit intovllm-project:dev/migrate-MR-v2from
Sy0307:fix/v2-improvements-on-2522

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 8, 2026

Purpose

Fix async_chunk mode producing garbage/short audio in V2 model runner.

Root cause: additional_information (containing thinker_decode_embeddings and thinker_output_token_ids) was never propagated from the scheduler's CachedRequestData to the runner's intermediate_buffer during decode steps. The chunk_transfer_adapter correctly polled data from SharedMemoryConnector and attached it to scheduled_cached_reqs.additional_information, but GPUModelRunner.update_requests() does not handle this field — so the data was silently dropped.

Additionally fixes three correctness issues found during review:

  1. _handle_async_chunk_updates passed raw AdditionalInformationPayload objects to intermediate_buffer.update(), which expects dict — causing AttributeError when payload is not pre-resolved
  2. Inline deserialization in scheduler only preserved list_data, silently dropping tensor_data and scalar_data entries
  3. cleanup() (sender+receiver) replaced cleanup_receiver() in the failed-KV-load path, risking race conditions with background save threads

Test Plan

  • Qwen3-Omni-30B async_chunk end-to-end (use_audio query, 2xH20)
  • ASR verification: Whisper transcription matches expected text output
  • Audio duration: 22.04s (previously 4.57s with the bug)
  • Sync mode regression check (non-async_chunk path unchanged)

Test Result

Before fix: Talker sees thinker_output_token_ids=[], thinker_decode_embeddings=None -> early EOS after ~33 decode steps -> 4.57s noise audio

After fix: Talker correctly receives incremental thinker data -> 329+ decode steps -> 22.04s audio, ASR output:

"The audio contains a man reciting the nursery rhyme Mary had a little lamb. He begins by saying the first words I spoke in the original phonograph before reciting the rhyme. Mary had a little lamb. Its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go."

@Sy0307 Sy0307 changed the title [Bugfix] Fix async_chunk additional_information delivery to V2 model runner [Feat] Support async_chunk additional_information delivery to V2 model runner Apr 8, 2026
…nner

- Add update_requests() to OmniGPUModelRunner to propagate
  additional_information from scheduler to intermediate_buffer
- Use _resolve_additional_information for AdditionalInformationPayload
  deserialization in both AR and generation runners
- Revert cleanup() to cleanup_receiver() for concurrent safety
- Fix _safe_get_rope control flow (remove exception-as-goto pattern)
- Add Talker M-RoPE fallback returning 3D sequential positions

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 force-pushed the fix/v2-improvements-on-2522 branch from 4e80cc4 to a6ef196 Compare April 8, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant