Replies: 1 comment
-
Yes, take a look at the Qwen-Audio model. It uses the encoder of Whisper as an input to the Qwen LLM: " Although Whisper is supervised trained for speech recognition |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I know it would be possible to train to detect, for instance, 'sentiment' in audio, using the output of the encoder layer in the whisper model as features to train another nn. Maybe using supervision and labelled data.
The question is - would this be hamstrung from the start? Because whisper is trained using 'semi-supervised' learning, would this tend to suppress all information not related to the actual text transcription (including tone, sentiment, background, etc)? Anybody tried this?
Beta Was this translation helpful? Give feedback.
All reactions