Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 368d815bcf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/metrics/stats.py
Outdated
| and (multimodal_output := output_to_yield.multimodal_output.get("audio")) is not None | ||
| ): | ||
| nframes = int(multimodal_output[-1].shape[0]) | ||
| nframes = int(multimodal_output.shape[0]) |
There was a problem hiding this comment.
Guard list audio payloads before reading tensor shape
When audio models return more than one waveform per prompt, DiffusionEngine.step stores multimodal_output["audio"] as a Python list (audio_payload = outputs when len(outputs) > 1), but this code now unconditionally calls .shape[0] on that value. In that scenario, record_audio_generated_frames raises AttributeError: 'list' object has no attribute 'shape', aborting the request path instead of just updating metrics.
Useful? React with 👍 / 👎.
|
Thanks for your changes—this is a more robust approach. What motivated the change to |
There was a problem hiding this comment.
Pull request overview
Fixes regressions introduced by PR #891 affecting Qwen3-TTS audio metrics computation, and updates Qwen3-TTS tokenizer padding/decoding behavior to align with upstream tokenizer changes.
Changes:
- Update Qwen3-TTS tokenizer decode paths to support
-1padding (via clamping before decode). - Change Qwen3-TTS tokenizer wrapper padding from
0to-1for padded batches. - Make audio frame metrics aggregation handle non-list audio payloads by summing frames across chunks.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/modeling_qwen3_tts_tokenizer_v1.py | Adjusts decode to compute lengths before clamping padded codes. |
| vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py | Same as above for the 12Hz tokenizer variant. |
| vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py | Changes batch padding value for audio codes from 0 to -1. |
| vllm_omni/metrics/stats.py | Updates audio-generated-frames metric computation to sum across chunked outputs. |
Comments suppressed due to low confidence (2)
vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py:1000
- Same issue as V1:
audio_lengthsis computed before clamping (to ignore-1padding) but then overwritten aftertorch.clamp()using(audio_codes[..., 0] > 0), which can miscompute lengths once pads have been clamped to 0 (and if 0 is a valid code). Use the pre-clamp length for trimming and drop or correct the post-clamp recomputation.
audio_lengths = (audio_codes[..., 0] > -1).sum(1) * self.decode_upsample_rate
audio_codes = torch.clamp(audio_codes, min=0)
audio_values = self.decoder.chunked_decode(audio_codes.transpose(1, 2)).squeeze(1)
audio_lengths = (audio_codes[..., 0] > 0).sum(1) * self.decode_upsample_rate
audio_values = [a[:length] for a, length in zip(audio_values, audio_lengths)]
vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/modeling_qwen3_tts_tokenizer_v1.py:1519
- In
decode(),audio_lengthsis computed using the pre-clamp codes (to account for-1padding), but then it is overwritten using(audio_codes > 0)aftertorch.clamp(). This defeats the purpose of using-1padding (pads become 0) and can truncate outputs incorrectly when code 0 is valid. Keep and use the pre-clamp length (e.g., based onaudio_codes > -1/>= 0) and remove or adjust the secondaudio_lengthsassignment.
audio_lengths = (audio_codes > -1).sum(1) * self.decode_upsample_rate
audio_codes = torch.clamp(audio_codes, min=0)
audio_values = self.decoder(code=audio_codes, reference_mel=ref_mels, conditioning=xvectors)
audio_lengths = (audio_codes > 0).sum(1) * self.decode_upsample_rate
audio_values = [a[:length] for a, length in zip(audio_values, audio_lengths)]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vllm_omni/metrics/stats.py
Outdated
| nframes = sum( | ||
| int(t.shape[0]) | ||
| for t in (multimodal_output if isinstance(multimodal_output, list) else [multimodal_output]) | ||
| ) |
There was a problem hiding this comment.
record_audio_generated_frames() still isn’t robust to the audio payload types that triggered the original failure. If multimodal_output is an empty tuple / tuple of chunks, the current logic wraps it as a single element and then accesses t.shape, which will raise. Also, using t.shape[0] undercounts for common audio shapes like [1, N] (e.g., Qwen3-Omni code2wav returns reshape(1, -1)), so the metric becomes 1 instead of N. Consider handling (list, tuple) as sequences, gracefully treating empty sequences as 0 frames, and counting frames via shape[-1] (or numel() for 1D) to match how audio chunks are concatenated along the last dimension elsewhere.
vllm_omni/metrics/stats.py
Outdated
| nframes = sum( | ||
| int(t.shape[0]) | ||
| for t in (multimodal_output if isinstance(multimodal_output, list) else [multimodal_output]) | ||
| ) |
There was a problem hiding this comment.
There are existing unit tests for OrchestratorAggregator (see tests/metrics/test_stats.py), but the updated record_audio_generated_frames() behavior isn’t covered. Adding tests for at least these cases would prevent regressions: audio as a single tensor, audio as a list/tuple of chunk tensors, and empty list/tuple (should record 0 without raising).
Gaohan123
left a comment
There was a problem hiding this comment.
Thanks for the nice catch. Could you please post some test results?
Hi, do you mean the stats? Like this one: |
linyueqian
left a comment
There was a problem hiding this comment.
Thanks for syncing with the upstream Qwen3-TTS tokenizer fix and addressing the metrics crash!
Metrics fix (stats.py): This change is no longer needed — the crash was already fixed on main via #1206 with a more defensive implementation (try/except guard, len check, scalar tensor handling). Since this branch was forked before that merge, the fix here is outdated. I'd suggest dropping the stats.py changes and rebasing on latest main.
Tokenizer padding fix: The padding_value=-1 and torch.clamp changes are correct and align with upstream, but there's a critical issue — see inline comments.
vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/modeling_qwen3_tts_tokenizer_v1.py
Outdated
Show resolved
Hide resolved
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: pablo <juanz9312@gmail.com>
170ba43 to
2cb5d11
Compare
|
thanks @linyueqian, it's aligned now to upstream. |
|
@Gaohan123 @linyueqian I closed this by mistake. Could we merge this? |
|
LGTM @hsliuustc0106 |
|
@vllm-omni-reviewer |
🤖 VLLM-Omni PR ReviewCode Review: [Bugfix][Qwen3TTS]1. OverviewThis PR fixes a bug introduced in PR #891 that caused an
The changes align with an upstream update in the Qwen3-TTS repository. Overall Assessment: Positive - The fix addresses a real bug and follows the upstream implementation pattern. 2. Code QualityStrengths
Potential IssuesLogic correctness of padding value change: Order of operations: # Correct order:
audio_lengths = (audio_codes[..., 0] > -1).sum(1) * self.decode_upsample_rate # Before clamp
audio_codes = torch.clamp(audio_codes, min=0) # Then clamp
audio_values = self.decoder.chunked_decode(...) # Then decode3. Architecture & DesignIntegration
Design ConsiderationThe use of 4. Security & SafetyNo significant security concerns. The changes are:
5. Testing & DocumentationConcernsMissing Test Plan and Results:
Recommended test verification: # Example test command that should be documented
python -c "
from vllm_omni import AsyncLLMEngine
# ... test code that reproduces the original error scenario
"6. Specific Suggestions
|
Signed-off-by: pablo <juanz9312@gmail.com> Co-authored-by: Gao Han <gaohan19@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR #891 introduced an issue while computing the metrics when sending a TTS request with Qwen3TTS:
Also, there's an update in the Qwen3 Speech tokenizer done in the main repo: QwenLM/Qwen3-TTS@6cafe55
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)