When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text #2645

ojama · 2025-09-07T07:39:44Z

ojama
Sep 7, 2025

Environment
• Model: whisper-large-v3

Bug Description

When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text:

“请不吝点赞订阅转发打赏支持明镜与点点栏目”

This output is completely unrelated to the input audio.

Expected Behavior

For silent audio or background noise input, the model should:
• Return an empty string ("") or a special token (e.g. [NO_SPEECH]),
• Or explicitly indicate that no human speech was detected.

It should not output irrelevant text such as advertisements or unrelated phrases.

To Reproduce
1. Load whisper-large-v3
2. Provide an input audio clip that is either silent or only contains background noise
3. Observe that the model outputs the unrelated fixed phrase

Possible Cause
• The training dataset may contain a large number of samples with background music + subtitles, where the subtitles do not match the actual audio content.
• Lack of volume-based filtering or VAD (Voice Activity Detection) during training, causing non-speech segments to be paired with subtitle text.
• Because such noisy data likely appeared frequently, the model learned to output these fixed high-frequency phrases in non-speech conditions.

Suggestions for Improvement
• Apply volume-based filtering or VAD to clean training data
• Assign no_speech labels for non-speech segments instead of pairing them with subtitle text
• Explicitly label audio source types (human speech / background music / environmental noise) in the training data so the model can distinguish them properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text #2645

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text #2645

Uh oh!

ojama Sep 7, 2025

Replies: 0 comments

ojama
Sep 7, 2025