🚀 Introducing Whisper-Flamingo, an audio-visual speech recognition and translation model! #2231

roudimit · 2024-06-17T14:00:25Z

roudimit
Jun 17, 2024

Excited to introduce Whisper-Flamingo! Check out the video demo below; Whisper-Flamingo can transcribe and translate speech with heavy background noise!

Paper: ArXiv
Code, pre-trained models, Colab: GitHub
2m demo comparing Whisper and Whisper-Flamingo: YouTube link
10m presentation: YouTube link

Whisper-Flamingo.teaser.mp4

We convert Whisper into an audio-visual speech recognition model so that it can use both audio and lip-based video as input.
Our audio-visual Whisper-Flamingo significantly outperforms the audio-only Whisper model when tested on noisy audio.
Our models transcribe English speech and translate English speech into 6 languages: Greek, Spanish, French, Italian, Portuguese, and Russian.
We are releasing our audio-visual models in three sizes (Large, Medium, Small), as well as the audio-only models fine-tuned on noisy audio.

Key Methods

Stage 1: Make the audio-only Whisper model more noise-robust. We fine-tune audio-only Whisper while adding noise to the audio.
Stage 2: Inject the visual modality into Whisper. We freeze audio-only Whisper, add new trainable cross-attention layers into Whisper's decoder attending to visual features from AV-HuBERT, and train the model on audio-visual inputs.
We enable En-X translation by training on English audio paired with English text and translations in 6 languages.

Let me know if you have any comments or questions!

ngcheeyuan · 2024-06-26T12:50:04Z

ngcheeyuan
Jun 26, 2024

Excellent work. Some questions.

What kind of videos can it handle? Must something be done to the video to isolate the lips to do the lip reading?
What's your experience with ASR on noisy audio? Does adding noise correlate strongly to real noise?
Does making Whisper more robust in English transcription, make it more robust in other languages?
If I would like to fine-tune Whisper to make it more robust for other languages. Can I randomly add different noises observed in real world setting?

Thank you for your time.

1 reply

roudimit Jun 30, 2024
Author

Thank you for your interest!

It will work best on videos where the person is facing towards the camera and the lips are clearly visible without obstructions. If the person moves their head too far to the left or right, the lips will become less visible which will make it harder for the model to read the lips. Yes, there is a preprocessing step required: the video is cropped on the lips and the angle of the lips is normalized. We provided a Google colab where you can try the model on your own video, and it implements the preprocesing step https://colab.research.google.com/drive/1rnhNOZuUxh-WXXloo_z1fu5DKeJrH95p
In this work, we added noise during training from the MUSAN dataset, which consists of real-world noise from categories like “natural”, “music” and “babble." During testing, we mixed together the clean audio from LRS3 with babble noise that we generated. This simulates audio recorded in noisy conditions, and allows us to control the signal-to-noise ratio. This should correlate with real-world noise conditions well. If you know the types of noise the model will be tested on, it's always best to try to train with that kind of noise.
The current model only performs transcription and En-X translation for English audio / video. The audio-visual model did achieve more robust En-X translation. However, future work can try to improve robustness for audio / video inputs in other languages.
Yes, you can. We provide the code for this (see Step 1: Fine-tune audio-only Whisper for En-X translation on MuAViC). Fine-tuning an ASR model with noise can lead to much better performance in noisy conditions. However, there is a tradeoff - the model becomes slightly worse in clean conditions. You can check Table 3 of our paper (https://arxiv.org/pdf/2406.10082). Fine-tuning Whisper with noise added to LRS3 audio improves WER from 20.8% to 11.7%, however the WER on the clean audio gets slightly worse from 2.1% to 2.3%.

Feel free to ask more questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Introducing Whisper-Flamingo, an audio-visual speech recognition and translation model! #2231

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

🚀 Introducing Whisper-Flamingo, an audio-visual speech recognition and translation model! #2231

Uh oh!

Uh oh!

roudimit Jun 17, 2024

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

ngcheeyuan Jun 26, 2024

Uh oh!

roudimit Jun 30, 2024 Author

roudimit
Jun 17, 2024

Replies: 1 comment 1 reply

ngcheeyuan
Jun 26, 2024

roudimit Jun 30, 2024
Author