Replies: 4 comments
-
|
I've encountered this problem too. Are there any good solutions? |
Beta Was this translation helpful? Give feedback.
-
|
I also have this problem. I have only 3 seconds of silence in my audio and this text occupy the first 30s that replace the subtitles it should be. |
Beta Was this translation helpful? Give feedback.
-
|
same |
Beta Was this translation helpful? Give feedback.
-
|
same,but faster-distil-whisper-large-v3 is right,I use it then translate to chinese. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment
• Model: whisper-large-v3
Bug Description
When recognizing Chinese, if the input audio is silent or consists only of background noise, the model often outputs the following unrelated fixed text:
“请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”
This output is completely unrelated to the input audio.
Expected Behavior
For silent audio or background noise input, the model should:
• Return an empty string ("") or a special token (e.g. [NO_SPEECH]),
• Or explicitly indicate that no human speech was detected.
It should not output irrelevant text such as advertisements or unrelated phrases.
To Reproduce
1. Load whisper-large-v3
2. Provide an input audio clip that is either silent or only contains background noise
3. Observe that the model outputs the unrelated fixed phrase
Possible Cause
• The training dataset may contain a large number of samples with background music + subtitles, where the subtitles do not match the actual audio content.
• Lack of volume-based filtering or VAD (Voice Activity Detection) during training, causing non-speech segments to be paired with subtitle text.
• Because such noisy data likely appeared frequently, the model learned to output these fixed high-frequency phrases in non-speech conditions.
Suggestions for Improvement
• Apply volume-based filtering or VAD to clean training data
• Assign no_speech labels for non-speech segments instead of pairing them with subtitle text
• Explicitly label audio source types (human speech / background music / environmental noise) in the training data so the model can distinguish them properly
Beta Was this translation helpful? Give feedback.
All reactions