You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text
#2645
When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text:
“请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”
This output is completely unrelated to the input audio.
Expected Behavior
For silent audio or background noise input, the model should:
• Return an empty string ("") or a special token (e.g. [NO_SPEECH]),
• Or explicitly indicate that no human speech was detected.
It should not output irrelevant text such as advertisements or unrelated phrases.
To Reproduce
1. Load whisper-large-v3
2. Provide an input audio clip that is either silent or only contains background noise
3. Observe that the model outputs the unrelated fixed phrase
Possible Cause
• The training dataset may contain a large number of samples with background music + subtitles, where the subtitles do not match the actual audio content.
• Lack of volume-based filtering or VAD (Voice Activity Detection) during training, causing non-speech segments to be paired with subtitle text.
• Because such noisy data likely appeared frequently, the model learned to output these fixed high-frequency phrases in non-speech conditions.
Suggestions for Improvement
• Apply volume-based filtering or VAD to clean training data
• Assign no_speech labels for non-speech segments instead of pairing them with subtitle text
• Explicitly label audio source types (human speech / background music / environmental noise) in the training data so the model can distinguish them properly
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Environment
• Model: whisper-large-v3
Bug Description
When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text:
“请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”
This output is completely unrelated to the input audio.
Expected Behavior
For silent audio or background noise input, the model should:
• Return an empty string ("") or a special token (e.g. [NO_SPEECH]),
• Or explicitly indicate that no human speech was detected.
It should not output irrelevant text such as advertisements or unrelated phrases.
To Reproduce
1. Load whisper-large-v3
2. Provide an input audio clip that is either silent or only contains background noise
3. Observe that the model outputs the unrelated fixed phrase
Possible Cause
• The training dataset may contain a large number of samples with background music + subtitles, where the subtitles do not match the actual audio content.
• Lack of volume-based filtering or VAD (Voice Activity Detection) during training, causing non-speech segments to be paired with subtitle text.
• Because such noisy data likely appeared frequently, the model learned to output these fixed high-frequency phrases in non-speech conditions.
Suggestions for Improvement
• Apply volume-based filtering or VAD to clean training data
• Assign no_speech labels for non-speech segments instead of pairing them with subtitle text
• Explicitly label audio source types (human speech / background music / environmental noise) in the training data so the model can distinguish them properly
Beta Was this translation helpful? Give feedback.
All reactions