🚀 Introducing mWhisper-Flamingo, a multilingual audio-visual speech recognition model! #2519
roudimit
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Excited to introduce mWhisper-Flamingo! Check out the demo video; mWhisper-Flamingo can transcribe multilingual speech with heavy background noise!
mWhisper-Flamingo.demo.v2.mp4
mWhisper-Flamingo is the multilingual follow-up to Whisper-Flamingo which converts Whisper into an AVSR model (but was only trained/tested on English videos).
We trained mWhisper-Flamingo on videos in 9 languages, and it outperforms audio-only Whisper on noisy audio!
Whisper-Flamingo's default training setup yielded minor improvements in noisy multilingual WER, despite significant improvements for English.
To fix this, we introduce decoder modality dropout by training the model both on paired audio-visual inputs and separate inputs.
We are releasing our audio-visual models in two sizes (Medium, Small), as well as the audio-only models fine-tuned on noisy audio.
Let me know if you have any comments or questions!
Beta Was this translation helpful? Give feedback.
All reactions