[Feat] Support async_chunk additional_information delivery to V2 model runner#2607
Draft
Sy0307 wants to merge 1 commit intovllm-project:dev/migrate-MR-v2from
Draft
[Feat] Support async_chunk additional_information delivery to V2 model runner#2607Sy0307 wants to merge 1 commit intovllm-project:dev/migrate-MR-v2from
Sy0307 wants to merge 1 commit intovllm-project:dev/migrate-MR-v2from
Conversation
…nner - Add update_requests() to OmniGPUModelRunner to propagate additional_information from scheduler to intermediate_buffer - Use _resolve_additional_information for AdditionalInformationPayload deserialization in both AR and generation runners - Revert cleanup() to cleanup_receiver() for concurrent safety - Fix _safe_get_rope control flow (remove exception-as-goto pattern) - Add Talker M-RoPE fallback returning 3D sequential positions Signed-off-by: Sy03 <1370724210@qq.com>
4e80cc4 to
a6ef196
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fix async_chunk mode producing garbage/short audio in V2 model runner.
Root cause:
additional_information(containingthinker_decode_embeddingsandthinker_output_token_ids) was never propagated from the scheduler'sCachedRequestDatato the runner'sintermediate_bufferduring decode steps. The chunk_transfer_adapter correctly polled data from SharedMemoryConnector and attached it toscheduled_cached_reqs.additional_information, butGPUModelRunner.update_requests()does not handle this field — so the data was silently dropped.Additionally fixes three correctness issues found during review:
_handle_async_chunk_updatespassed rawAdditionalInformationPayloadobjects tointermediate_buffer.update(), which expectsdict— causingAttributeErrorwhen payload is not pre-resolvedlist_data, silently droppingtensor_dataandscalar_dataentriescleanup()(sender+receiver) replacedcleanup_receiver()in the failed-KV-load path, risking race conditions with background save threadsTest Plan
Test Result
Before fix: Talker sees
thinker_output_token_ids=[],thinker_decode_embeddings=None-> early EOS after ~33 decode steps -> 4.57s noise audioAfter fix: Talker correctly receives incremental thinker data -> 329+ decode steps -> 22.04s audio, ASR output: