Skip to content

[Bugfix][Qwen3TTS]#1289

Merged
hsliuustc0106 merged 7 commits intovllm-project:mainfrom
JuanPZuluaga:qwen3tts.fix-bug
Feb 26, 2026
Merged

[Bugfix][Qwen3TTS]#1289
hsliuustc0106 merged 7 commits intovllm-project:mainfrom
JuanPZuluaga:qwen3tts.fix-bug

Conversation

@JuanPZuluaga
Copy link
Contributor

@JuanPZuluaga JuanPZuluaga commented Feb 9, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR #891 introduced an issue while computing the metrics when sending a TTS request with Qwen3TTS:

(APIServer pid=987948) INFO 02-09 12:37:54 [serving_speech.py:236] TTS speech request speech-bf3b5480759cbfef: text='Hello, how are you?', task_type=CustomVoice
(APIServer pid=987948) INFO 02-09 12:37:54 [async_omni.py:315] [AsyncOrchestrator] Entering scheduling loop: stages=1, final_stage=0
(Worker pid=989530) [Stage-0] INFO 02-09 12:37:54 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=989530) Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310] Speech generation failed: tuple index out of range
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310] Traceback (most recent call last):
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]   File "/home/pablo/agigo/claude_code/agigo_tts_offline/dialog-platform/tts/agigo-tts-offline/vllm-omni/vllm_omni/entrypo
ints/openai/serving_speech.py", line 253, in create_speech
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]     async for res in generator:
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]   File "/home/pablo/agigo/claude_code/agigo_tts_offline/dialog-platform/tts/agigo-tts-offline/vllm-omni/vllm_omni/entrypo
ints/async_omni.py", line 331, in generate
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]     async for output in self._process_sequential_results(
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]   File "/home/pablo/agigo/claude_code/agigo_tts_offline/dialog-platform/tts/agigo-tts-offline/vllm-omni/vllm_omni/entrypo
ints/async_omni.py", line 431, in _process_sequential_results
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]     metrics.record_audio_generated_frames(
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]   File "/home/pablo/agigo/claude_code/agigo_tts_offline/dialog-platform/tts/agigo-tts-offline/vllm-omni/vllm_omni/metrics
/stats.py", line 229, in record_audio_generated_frames
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]     nframes = int(multimodal_output[-1].shape[0])
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310]                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(APIServer pid=987948) ERROR 02-09 12:37:55 [serving_speech.py:310] IndexError: tuple index out of range

Also, there's an update in the Qwen3 Speech tokenizer done in the main repo: QwenLM/Qwen3-TTS@6cafe55

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 368d815bcf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

and (multimodal_output := output_to_yield.multimodal_output.get("audio")) is not None
):
nframes = int(multimodal_output[-1].shape[0])
nframes = int(multimodal_output.shape[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard list audio payloads before reading tensor shape

When audio models return more than one waveform per prompt, DiffusionEngine.step stores multimodal_output["audio"] as a Python list (audio_payload = outputs when len(outputs) > 1), but this code now unconditionally calls .shape[0] on that value. In that scenario, record_audio_generated_frames raises AttributeError: 'list' object has no attribute 'shape', aborting the request path instead of just updating metrics.

Useful? React with 👍 / 👎.

@LJH-LBJ
Copy link
Contributor

LJH-LBJ commented Feb 9, 2026

Thanks for your changes—this is a more robust approach. What motivated the change to audio_lengths?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes regressions introduced by PR #891 affecting Qwen3-TTS audio metrics computation, and updates Qwen3-TTS tokenizer padding/decoding behavior to align with upstream tokenizer changes.

Changes:

  • Update Qwen3-TTS tokenizer decode paths to support -1 padding (via clamping before decode).
  • Change Qwen3-TTS tokenizer wrapper padding from 0 to -1 for padded batches.
  • Make audio frame metrics aggregation handle non-list audio payloads by summing frames across chunks.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/modeling_qwen3_tts_tokenizer_v1.py Adjusts decode to compute lengths before clamping padded codes.
vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py Same as above for the 12Hz tokenizer variant.
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py Changes batch padding value for audio codes from 0 to -1.
vllm_omni/metrics/stats.py Updates audio-generated-frames metric computation to sum across chunked outputs.
Comments suppressed due to low confidence (2)

vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py:1000

  • Same issue as V1: audio_lengths is computed before clamping (to ignore -1 padding) but then overwritten after torch.clamp() using (audio_codes[..., 0] > 0), which can miscompute lengths once pads have been clamped to 0 (and if 0 is a valid code). Use the pre-clamp length for trimming and drop or correct the post-clamp recomputation.
        audio_lengths = (audio_codes[..., 0] > -1).sum(1) * self.decode_upsample_rate

        audio_codes = torch.clamp(audio_codes, min=0)
        audio_values = self.decoder.chunked_decode(audio_codes.transpose(1, 2)).squeeze(1)

        audio_lengths = (audio_codes[..., 0] > 0).sum(1) * self.decode_upsample_rate
        audio_values = [a[:length] for a, length in zip(audio_values, audio_lengths)]

vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/modeling_qwen3_tts_tokenizer_v1.py:1519

  • In decode(), audio_lengths is computed using the pre-clamp codes (to account for -1 padding), but then it is overwritten using (audio_codes > 0) after torch.clamp(). This defeats the purpose of using -1 padding (pads become 0) and can truncate outputs incorrectly when code 0 is valid. Keep and use the pre-clamp length (e.g., based on audio_codes > -1 / >= 0) and remove or adjust the second audio_lengths assignment.
        audio_lengths = (audio_codes > -1).sum(1) * self.decode_upsample_rate

        audio_codes = torch.clamp(audio_codes, min=0)
        audio_values = self.decoder(code=audio_codes, reference_mel=ref_mels, conditioning=xvectors)

        audio_lengths = (audio_codes > 0).sum(1) * self.decode_upsample_rate
        audio_values = [a[:length] for a, length in zip(audio_values, audio_lengths)]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 229 to 232
nframes = sum(
int(t.shape[0])
for t in (multimodal_output if isinstance(multimodal_output, list) else [multimodal_output])
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_audio_generated_frames() still isn’t robust to the audio payload types that triggered the original failure. If multimodal_output is an empty tuple / tuple of chunks, the current logic wraps it as a single element and then accesses t.shape, which will raise. Also, using t.shape[0] undercounts for common audio shapes like [1, N] (e.g., Qwen3-Omni code2wav returns reshape(1, -1)), so the metric becomes 1 instead of N. Consider handling (list, tuple) as sequences, gracefully treating empty sequences as 0 frames, and counting frames via shape[-1] (or numel() for 1D) to match how audio chunks are concatenated along the last dimension elsewhere.

Copilot uses AI. Check for mistakes.
Comment on lines 229 to 232
nframes = sum(
int(t.shape[0])
for t in (multimodal_output if isinstance(multimodal_output, list) else [multimodal_output])
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for OrchestratorAggregator (see tests/metrics/test_stats.py), but the updated record_audio_generated_frames() behavior isn’t covered. Adding tests for at least these cases would prevent regressions: audio as a single tensor, audio as a list/tuple of chunk tensors, and empty list/tuple (should record 0 without raising).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice catch. Could you please post some test results?

@Gaohan123 Gaohan123 added this to the v0.16.0 milestone Feb 10, 2026
@JuanPZuluaga
Copy link
Contributor Author

JuanPZuluaga commented Feb 10, 2026

Thanks for the nice catch. Could you please post some test results?

Hi, do you mean the stats? Like this one:

(APIServer pid=3129320) INFO 02-10 06:29:15 [serving_speech.py:236] TTS speech request speech-806e9155062abcf3: text='Hi Gao Han, vllm-omni is a very cool project!', task_type=CustomVoice
(APIServer pid=3129320) INFO 02-10 06:29:15 [async_omni.py:315] [AsyncOrchestrator] Entering scheduling loop: stages=1, final_stage=0
(Worker pid=3130826) [Stage-0] INFO 02-10 06:29:15 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=3130826) Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429]
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] [Overall Summary]
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] +-----------------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | Field                       |     Value |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] +-----------------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | e2e_requests                |         1 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | e2e_wall_time_ms            | 2,596.641 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | e2e_total_tokens            |        25 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | e2e_avg_time_per_request_ms | 2,596.641 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | e2e_avg_tokens_per_s        |     9.628 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] | e2e_stage_0_wall_time_ms    | 2,596.608 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:429] +-----------------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455]
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] [RequestE2EStats [request_id=speech-806e9155062abcf3]]
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] +------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] | Field            |     Value |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] +------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] | e2e_total_ms     | 2,596.607 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] | e2e_total_tokens |        25 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:455] +------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508]
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] [StageRequestStats [request_id=speech-806e9155062abcf3]]
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] +------------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] | Field                  |         0 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] +------------------------+-----------+
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] | audio_generated_frames |    86,400 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] | batch_id               |         1 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] | batch_size             |         1 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] | num_tokens_in          |        25 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] | stage_gen_time_ms      | 2,591.946 |
(APIServer pid=3129320) INFO 02-10 06:29:18 [stats.py:508] +------------------------+-----------+

Copy link
Contributor

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for syncing with the upstream Qwen3-TTS tokenizer fix and addressing the metrics crash!

Metrics fix (stats.py): This change is no longer needed — the crash was already fixed on main via #1206 with a more defensive implementation (try/except guard, len check, scalar tensor handling). Since this branch was forked before that merge, the fix here is outdated. I'd suggest dropping the stats.py changes and rebasing on latest main.

Tokenizer padding fix: The padding_value=-1 and torch.clamp changes are correct and align with upstream, but there's a critical issue — see inline comments.

Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: pablo <juanz9312@gmail.com>
@JuanPZuluaga
Copy link
Contributor Author

thanks @linyueqian, it's aligned now to upstream.

@JuanPZuluaga JuanPZuluaga deleted the qwen3tts.fix-bug branch February 21, 2026 10:57
@JuanPZuluaga JuanPZuluaga restored the qwen3tts.fix-bug branch February 23, 2026 14:47
@JuanPZuluaga JuanPZuluaga reopened this Feb 23, 2026
@JuanPZuluaga
Copy link
Contributor Author

@Gaohan123 @linyueqian I closed this by mistake. Could we merge this?

@linyueqian
Copy link
Contributor

LGTM @hsliuustc0106

@hsliuustc0106
Copy link
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link

🤖 VLLM-Omni PR Review

Code Review: [Bugfix][Qwen3TTS]

1. Overview

This PR fixes a bug introduced in PR #891 that caused an IndexError: tuple index out of range when computing metrics for TTS requests with Qwen3TTS. The fix involves:

  1. Changing the padding value from 0 to -1 in the audio codes padding sequence
  2. Computing audio lengths before decoding using the -1 sentinel value
  3. Adding torch.clamp(audio_codes, min=0) to ensure negative padding values don't propagate to the decoder

The changes align with an upstream update in the Qwen3-TTS repository.

Overall Assessment: Positive - The fix addresses a real bug and follows the upstream implementation pattern.

2. Code Quality

Strengths

  • Consistent application: The fix is applied consistently across both tokenizer versions (12Hz and 25Hz)
  • Minimal changes: The diff is focused and targeted on the specific issue
  • Follows upstream: Aligns with the referenced Qwen3-TTS commit

Potential Issues

Logic correctness of padding value change:
The change from padding_value=0 to padding_value=-1 is correct because 0 could be a valid audio code token. Using -1 as a sentinel value is a common pattern for masking/padding.

Order of operations:
The reordering to compute audio_lengths before decoding is important because after torch.clamp, the -1 padding values become 0, which would make the length calculation incorrect if done after clamping.

# Correct order:
audio_lengths = (audio_codes[..., 0] > -1).sum(1) * self.decode_upsample_rate  # Before clamp
audio_codes = torch.clamp(audio_codes, min=0)  # Then clamp
audio_values = self.decoder.chunked_decode(...)  # Then decode

3. Architecture & Design

Integration

  • The changes integrate well with the existing codebase structure
  • Both tokenizer variants (12Hz and 25Hz) receive the same fix, maintaining consistency

Design Consideration

The use of -1 as a padding sentinel and subsequent clamping is a reasonable design pattern for handling variable-length sequences in tensor operations.

4. Security & Safety

No significant security concerns. The changes are:

  • Memory-safe (no new allocations that could cause issues)
  • Numerically safe (clamping prevents negative indices/values)

5. Testing & Documentation

Concerns

Missing Test Plan and Results:
The PR description has empty "Test Plan" and "Test Result" sections. Given the error traceback provided, the PR should include:

  1. Verification that the specific error case now works
  2. Confirmation that audio output is correctly generated
  3. Metrics are properly recorded

Recommended test verification:

# Example test command that should be documented
python -c "
from vllm_omni import AsyncLLMEngine
# ... test code that reproduces the original error scenario
"

6. Specific Suggestions

qwen3_tts_tokenizer.py:331

audio_codes_padded = pad_sequence(audio_codes_list, batch_first=True, padding_value=-1).to(self.device)

Suggestion: Consider adding a comment explaining why -1 is used as the padding value:

# Use -1 as padding value since 0 can be a valid audio code token
audio_codes_padded = pad_sequence(audio_codes_list, batch_first=True, padding_value=-1).to(self.device)

tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py:994-997

The logic is correct, but consider adding a brief comment:

# Compute length before clamping since -1 is used as padding marker
audio_lengths = (audio_codes[..., 0] > -1).sum(1) * self.decode_upsample_rate
# Clamp to ensure padding values don't cause issues in decoder
audio_codes = torch.clamp(audio_codes, min=0)

tokenizer_25hz/modeling_qwen3_tts_tokenizer_v1.py:1513-1516

Same suggestion as above for consistency.

7. Approval Status

LGTM with suggestions

The fix is technically correct and addresses the root cause of the bug. The changes follow the upstream implementation and are applied consistently across both tokenizer variants.

Before merging, please:

  1. Fill in the Test Plan and Test Results sections in the PR description
  2. Verify the fix resolves the original error with a concrete test case
  3. Consider adding the suggested comments for code clarity (optional but recommended)

The core logic changes are sound and ready to merge once testing is documented.


This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Feb 25, 2026
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit 66457c3 into vllm-project:main Feb 26, 2026
6 of 7 checks passed
@JuanPZuluaga JuanPZuluaga deleted the qwen3tts.fix-bug branch February 26, 2026 13:54
xuechendi pushed a commit to xuechendi/vllm-omni that referenced this pull request Feb 26, 2026
Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants