Replies: 2 comments
-
it's the hallucination problem #679, the cause is, more or less as you said, from training data #928 for now the solution is to extract vocals using source separation models like Spleeter or Demucs if you want more accurate sound classification, it requires fine-tuning |
Beta Was this translation helpful? Give feedback.
0 replies
-
update: u can try https://github.com/YuanGongND/whisper-at |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Just a quick signpost having run this across a few thousand hours of mixed materials that whenever orchestral instrumental music is present, whisper has a strong bias towards inferring incorrect titles for a handful of works, but especially Edward Elgar's "Pomp and Circumstance." Assume there are some graduations mixed into the training set as there are a fair number of variations on it including movements, key, and named attribution to Elgar.
These have been my biggest offenders but there are hundreds more, YMMV. Typically identifiable in a post-process regex search surrounded by brackets (but not always).
A generic
(music playing)
sound atmospheric normalizer would be a super helpful addition, and/or -- a boy can dream-- a genre recognizer (jazz-music playing), (orchestral music playing) (vocal music playing). In the same general space, inclusion of non-verbal atmospherics for which I assume there must be data floating around (phone ringing), (siren), (car horn), would be a welcome option as well.Beta Was this translation helpful? Give feedback.
All reactions