VSR performance lower on MuAViC version of LRS3 (En)

Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.

I first tried `ckpt=large_noise_pt_noise_ft_433h.pt` from AV-HuBERT, and ran this command:

```
python -B infer_s2s.py --config-dir ./conf/ --config-name s2s_decode.yaml \
  dataset.gen_subset=test common_eval.path=${ckpts_dir}/${ckpt} \
  common_eval.results_path=${exp_dir}/av-hubert/decode/s2s/test \
  override.modalities=['audio', 'video'] override.data=${lrs3_dir}/30h_data override.label_dir=${lrs3_dir}/30h_data common.user_dir=`pwd`
```

Using the AV-HuBERT version of LRS3:
- 433 audio-visual: 1.486
- 433h audio-only: 1.951
- 433h video-only: 34.135

Using the MuAViC version of LRS3:
- 433 audio-visual: 1.496 (slightly worse)
- 433h audio-only: 1.951 (the same)
- 433h video-only: 35.995 (noticeably worse)

It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.

I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:
- 433 audio-visual: 2.1941
- 433h audio-only: 3.22 
- 433h video-only: 35.995 

Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:
- 433h audio-visual: 2.153 (slightly better)
- 433h audio-only: 3.225 (the same)
- 433h video-only: 34.459 (noticeably better).

The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSR performance lower on MuAViC version of LRS3 (En) #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VSR performance lower on MuAViC version of LRS3 (En) #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions