Skip to content
This repository was archived by the owner on Jul 17, 2025. It is now read-only.

VSR performance lower on MuAViC version of LRS3 (En) #13

@roudimit

Description

@roudimit

Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.

I first tried ckpt=large_noise_pt_noise_ft_433h.pt from AV-HuBERT, and ran this command:

python -B infer_s2s.py --config-dir ./conf/ --config-name s2s_decode.yaml \
  dataset.gen_subset=test common_eval.path=${ckpts_dir}/${ckpt} \
  common_eval.results_path=${exp_dir}/av-hubert/decode/s2s/test \
  override.modalities=['audio', 'video'] override.data=${lrs3_dir}/30h_data override.label_dir=${lrs3_dir}/30h_data common.user_dir=`pwd`

Using the AV-HuBERT version of LRS3:

  • 433 audio-visual: 1.486
  • 433h audio-only: 1.951
  • 433h video-only: 34.135

Using the MuAViC version of LRS3:

  • 433 audio-visual: 1.496 (slightly worse)
  • 433h audio-only: 1.951 (the same)
  • 433h video-only: 35.995 (noticeably worse)

It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.

I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:

  • 433 audio-visual: 2.1941
  • 433h audio-only: 3.22
  • 433h video-only: 35.995

Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:

  • 433h audio-visual: 2.153 (slightly better)
  • 433h audio-only: 3.225 (the same)
  • 433h video-only: 34.459 (noticeably better).

The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions