The Large V2 model of Whisper incorrectly predicts YouTube-related phrases, such as "don't forget to subscribe and like the video" or "subscribe to our channel", in some audio recordings, especially those with silence at the end. These sentences are not present in the actual audio content. #1101
Replies: 5 comments
-
I can confirm this was also an issue with v1. see for example: #29 (comment) |
Beta Was this translation helpful? Give feedback.
-
See also these related discussions, just fyi |
Beta Was this translation helpful? Give feedback.
-
Had a similar issue where one of my recordings started with some completely indiscernible background chatter in a conference room and Whisper inserted "@2013 Mooji Media Lt. All Rights Reserved" |
Beta Was this translation helpful? Give feedback.
-
Confirmed. I've seen it in Polish language too. |
Beta Was this translation helpful? Give feedback.
-
Same for Russian language. When silence it gives some random words, not exactly in Russian. It can be Korean, English, some random letters too. Mostly it tells something like "Dont forget to subscribe" or "Subtitles were made by... (some random name)". It also recognizes music, example: "Calm music" etc |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have been working with the Whisper Large V2 model for transcribing audio recordings. In several cases, particularly when there is silence at the end of the audio, the model produces predictions that include YouTube-related phrases that are not actually part of the content. This issue seems to occur only at the end of the content.
Examples of incorrectly predicted phrases include:
This might be related to the presence of a large amount of YouTube information in the training data.
Steps to reproduce
Expected behavior
The model should not predict YouTube-related phrases when they are not present in the audio content, even when there is silence at the end of the recording.
Actual behavior
The model incorrectly predicts YouTube-related phrases that are not part of the audio content, particularly when there is silence at the end of the recording.
Possible solution
A potential solution could involve re-examining the training data to reduce the influence of such YouTube-related phrases, or refining the model to be more context-aware when predicting sentences in silent sections of the audio.
Additional context
This issue has been observed only in the Large V2 model and primarily occurs at the end of the audio content.
Beta Was this translation helpful? Give feedback.
All reactions