Skip to content

wsntxxn/UniFlow-Audio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

104 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”‰ UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

arXiv githubio Hugging Face Spaces Hugging Face Model

This repository is the official implementation of the paper "UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities". We provide a lightweight training framework, built on Accelerate. It can be customized easily with all code exposed in trainer.py.

πŸ’‘ Inference using Pre-trained Models

Dependency Installation

First, please install dependencies required for training and inference.

conda create -n uniflow-audio python=3.10

If you want to perform text-to-speech (TTS) synthesis inference, you also need to install montreal-forced-aligner, so executing the following command instead:

conda create -n uniflow-audio -c conda-forge python=3.10 montreal-forced-aligner

Then install python dependencies:

conda activate uniflow-audio
pip install -r requirements.txt

Optional Dependencies for TTS Inference

To extract speaker embedding for TTS inference, you need to install wespeaker:

pip install git+https://github.com/wenet-e2e/wespeaker.git

Optional Dependencies for V2A Inference

To perform video-to-audio (V2A) generation inference, please install the following additional libraries:

pip install moviepy av torchvision

Running Inference

Please refer to INFERENCE_CLI.md for inference CLI examples.

πŸ› οΈ Training

Data Format

For each generation dataset, the input content information should be organized in a content.jsonl. Each line in content.jsonl is like:

{"audio_id": "xxx", "caption": "xxx"}

The target audio files should be organized in an audio.jsonl, with similar formats:

{"audio_id": "xxx", "audio": "/path/to/audio/file"}

Then, for each task type, implement a class by inheriting AudioGenerationDataset in data_module/dataset.py: the content loading method is defined here.

For datasets used in the paper, our pre-processing scripts are in data_preprocess. You may use them as reference to process your own data.

Configurations

We use hydra + omegaconf to organize training configurations.

  • hydra organizes the configuration into separate modules by defaults list, and supports command line overrides. See docs and examples in configs.
  • omegaconf supports custom resolvers with native variable interpolations, so fields in YAML can be set more dynamically. See above docs for more details.

Hydra Override Examples

Here are some hydra override examples:

Example 1
python inference.py +data_dict.audiocaps.test.max_samples=100

It sets the maximum number of samples for the test split of audiocaps dataset to 100.

Example 2
accelerate launch train.py \
  model/backbone=input_fusion_dit

It uses input_fusion_dit instead of the original layer_fusion_dit. This is an example of overriding a config group that is not at the top level.

Customize Training

Like pytorch-lightning, this framework makes a little abstraction on the native PyTorch-based training loop, making training on new models, datasets and loss functions easier. The most efforts lie in implementing these components and write YAML configs correspondingly:

  1. Implement datasets, models, loss functions...: This is the same as normal PyTorch-based training pipeline.
  2. Implement custom trainer: Similar to LightningModule in pytorch-lightning, we define a bunch of hooks in the training loop. To customize the training process, minimally we just need to define the behavior of training_step and validation_step. We can also customize other hooks, such as on_train_start and on_validation_start. audio_generation_trainer.py gives an example.
  3. Write YAML files: YAML configs need to be configured to use the dataset, model, ..., and trainer defined above. Among them, "train_dataloader", "val_dataloader", "optimizer", "lr_scheduler" and "loss_fn" must be specified.

The YAML format is hydra-style, for example:

object:
  _target_: module.submoule.Class
  param1: value1
  param2: value2
  sub_object:
    _target_: module.submodule.SubClass
    param1: value1
    param2: value2

The object will be instantiated recursively.

Launch Training

Training is launched by accelerate command line tool:

accelerate launch train.py
# or
accelerate launch train.py --config-path path/to/config/dir --config-name conf 

This will use path/to/config/dir/conf.yaml as the configuration entrypoint, and ${HF_HOME}/accelerate/default_config.yaml for accelerate configuration.

Command line overrides are stil supported:

accelerate launch --config_file configs/accelerate/nvidia/8gpus.yaml train.py \
    warmup_params.warmup_steps=500 \
    train_dataloader.batch_size=12 \
    val_dataloader.batch_size=12 \
    epochs=100

Inference

After training, experiment logging files, checkpoints, and other artifacts are saved in ${exp_dir} defined in configs/train.yaml. We still use accelerate to do inference:

exp_dir="/path/to/exp_dir"
ckpt_dir="/path/to/exp_dir/checkpoints/epoch_xxx"
accelerate launch \
  inference.py \
  data@data_dict=t2a_audiocaps \
  exp_dir=${exp_dir} \
  ckpt_dir_or_file=${ckpt_dir}

This will infer on AudioCaps test set with the default configurations in configs/inference.yaml.

πŸ“Š Evaluation

For evaluation, please refer to EVALUATION.md.

πŸ“ TODO

  • Add inference script for pre-trained models.
  • Add README about evaluation guidance.
  • Add interactive inference interface link.

πŸ“– Citation

If you found the paper or the codebase useful, please consider citing

@article{xu2025uniflow,
  title={UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities},
  author={Xu, Xuenan and Mei, Jiahao and Zheng, Zihao and Tao, Ye and Xie, Zeyu and Zhang, Yaoyun and Liu, Haohe and Wu, Yuning and Yan, Ming and Wu, Wen and Zhang, Chao and Wu, Mengyue},
  author={Zheng, Zihao and Xie, Zeyu and Xu, Xuenan and Wu, Wen and Zhang, Chao and Wu, Mengyue},
  journal={arXiv preprint arXiv:2509.24391},
  year={2025}
}

✨ Acknowledgements

We would like to express our gratitude to the following projects and their contributors, from which we have borrowed code or drawn inspiration:

We appreciate the open-source community for making these valuable resources available.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors