Skip to content
This repository was archived by the owner on Jul 17, 2025. It is now read-only.

Multilingual AVSR model decoding and training #16

@roudimit

Description

@roudimit

I downloaded the multilingual AVSR model (x_avsr) and tried to use the decoding script.
First, I ran into this error:

Traceback (most recent call last):   
  File "/usr/users/roudi/muavic/av_hubert/avhubert/infer_s2s.py", line 311, in hydra_main                                                         
    distributed_utils.call_main(cfg, main)
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main                                     
    main(cfg, **kwargs)
  File "/usr/users/roudi/muavic/av_hubert/avhubert/infer_s2s.py", line 96, in main
    return _main(cfg, h)                                                                                                                          
  File "/usr/users/roudi/muavic/av_hubert/avhubert/infer_s2s.py", line 118, in _main                                                              
    models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([cfg.common_eval.path])
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/checkpoint_utils.py", line 432, in load_model_ensemble_and_task                   
    task = tasks.setup_task(cfg.task)                                                                                                             
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/tasks/__init__.py", line 39, in setup_task
    cfg = merge_with_parent(dc(), cfg)
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/dataclass/utils.py", line 490, in merge_with_parent                               
    merged_cfg = OmegaConf.merge(dc, cfg)                                                                                                         
omegaconf.errors.ConfigKeyError: Key 'add_eos' not in 'AVHubertPretrainingConfig'
        full_key: add_eos
        reference_type=Optional[AVHubertPretrainingConfig]                                                                                        
        object_type=AVHubertPretrainingConfig  

I fixed this by adding add_eos: bool = field(default=False, metadata={"help": "hack: make the multilingual model work"}) to this line: https://github.com/facebookresearch/av_hubert/blob/e8a6d4202c208f1ec10f5d41a66a61f96d1c442f/avhubert/hubert_pretraining.py#L161

I ran decoding on a few languages. I noticed the model outputs a language tag in the hypothesis (examples: <fr> (Applaudissements), <es> (Aplausos)), while the reference doesn't contain the language tag.
My WERs were quite different than what's reported in the paper, but I found that adding the language tag to the reference sentences seems to make the WERs comparable to what's in the paper (removing the language tag in the hypothesis resulted in worse WER than reported). Just wanted to check if you used the language tag in the reference for evaluation in the multilingual setting?

The model sometimes outputs the text in the wrong language (as well as the incorrect language tag). Is there a way to force output text in a certain language?

I was also wondering how to train the multilingual model (the current training script seems to be for audio in one language). Specifically, should I add the language tag in the beginning of all of the sentences, and how do you balance samples from different languages?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions