Whisper encoder output as features for transfer learning? #1765

colinator · 2023-11-07T03:17:26Z

colinator
Nov 7, 2023

I know it would be possible to train to detect, for instance, 'sentiment' in audio, using the output of the encoder layer in the whisper model as features to train another nn. Maybe using supervision and labelled data.

The question is - would this be hamstrung from the start? Because whisper is trained using 'semi-supervised' learning, would this tend to suppress all information not related to the actual text transcription (including tone, sentiment, background, etc)? Anybody tried this?

smrl · 2023-12-21T17:32:57Z

smrl
Dec 21, 2023

Yes, take a look at the Qwen-Audio model. It uses the encoder of Whisper as an input to the Qwen LLM:
https://arxiv.org/pdf/2311.07919.pdf

" Although Whisper is supervised trained for speech recognition
and translation, its encoded representation still contains rich information, such as background noise (Gong
et al., 2023a), and can even be used for recovering the original speech (Zhang et al., 2023b)"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper encoder output as features for transfer learning? #1765

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Whisper encoder output as features for transfer learning? #1765

Uh oh!

colinator Nov 7, 2023

Replies: 1 comment

Uh oh!

Uh oh!

smrl Dec 21, 2023

colinator
Nov 7, 2023

smrl
Dec 21, 2023