🚀 Introducing mWhisper-Flamingo, a multilingual audio-visual speech recognition model! #2519

roudimit · 2025-02-05T16:11:52Z

roudimit
Feb 5, 2025

Excited to introduce mWhisper-Flamingo! Check out the demo video; mWhisper-Flamingo can transcribe multilingual speech with heavy background noise!

Paper: ArXiv
Code, pre-trained models, Notebook: GitHub
1m demo of Whisper-Flamingo (same video below): YouTube link

mWhisper-Flamingo.demo.v2.mp4

mWhisper-Flamingo is the multilingual follow-up to Whisper-Flamingo which converts Whisper into an AVSR model (but was only trained/tested on English videos).
We trained mWhisper-Flamingo on videos in 9 languages, and it outperforms audio-only Whisper on noisy audio!
Whisper-Flamingo's default training setup yielded minor improvements in noisy multilingual WER, despite significant improvements for English.
To fix this, we introduce decoder modality dropout by training the model both on paired audio-visual inputs and separate inputs.
We are releasing our audio-visual models in two sizes (Medium, Small), as well as the audio-only models fine-tuned on noisy audio.

Let me know if you have any comments or questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Introducing mWhisper-Flamingo, a multilingual audio-visual speech recognition model! #2519

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

🚀 Introducing mWhisper-Flamingo, a multilingual audio-visual speech recognition model! #2519

Uh oh!

roudimit Feb 5, 2025

Replies: 0 comments

roudimit
Feb 5, 2025