[HomePage]
[ArXiv Paper]
[HF Paper]
[Model]
Under the JavisVerse project, this repo presents two versions of the JavisDiT series:
- JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization (a.k.a. JavisDiT-v0.1)
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation (a.k.a. JavisDiT-v1.0)
The main branch maintains JavisDiT++, a more concise yet powerful DiT model to generate semantically and temporally aligned sounding videos with textual conditions. Auxiliary support of JavisDiT (v0.1) can be found at assets/docs/JavisDiT.md.
JavisDiT++.Demo.mp4
- [2026.02.26] 🔥🔥 JavisDiT++ are also integrated into JavisGPT!
- [2026.02.26] 🔥🔥 This repository has upgraded from JavisDiT to JavisDiT++, both of which are accepted by ICLR 2026. For the initial version, please refer to assets/docs/JavisDiT.md.
- [2025.12.26] 🚀 JavisDiT and JavisGPT are integrated into the JavisVerse project. We hope to contribute to the Joint Audio-Video Intelligence Symphony (Javis) in the community.
- [2025.12.26] 🚀 We released JavisGPT, a unified multi-modal LLM for sounding-video comprehension and generation. For more details refer to this repo.
- [2025.08.11] 🔥 We released the data and code for JAVG evaluation. For more details refer to eval/javisbench/README.md.
- [2025.04.15] 🔥 We released the data preparation and model training instructions. You can train JavisDiT with your own dataset!
- [2025.04.07] 🔥 We released the inference code and a preview model of JavisDiT-v0.1 at HuggingFace.
- [2025.04.03] We release the repository of JavisDiT. Code, model, and data are coming soon.
- Release the data and evaluation code for JavisScore.
JavisDiT++ addresses the key bottleneck of JAVG with a unified perspective of modeling and optimization.
- We model JAVG via joint self-attention to enable dense inter-modal interaction, with modality-specific MoE (MS-MoE) design to refine intra-modal representation.
- We propose a temporally aligned rotary position encoding (TA-RoPE) scheme to ensure explicit and fine-grained audio-video token synchronization.
- We devise the AV-DPO technique to consistently improve audio-video quality and synchronization by aligning generation with human preferences.
We hope to set a new standard for the JAVG community. For more technical details, kindly refer to the original paper.
For CUDA 12.1, you can install the dependencies with the following commands.
# create a virtual env and activate (conda as an example)
conda create -n javisdit python=3.10
conda activate javisdit
# download the repo
git clone https://github.com/JavisVerse/JavisDiT
cd JavisDiT
# install torch, torchvision and xformers
pip install -r requirements/requirements-cu121.txt
# install ffpmeg
conda install -c conda-forge ffmpeg -y
# the default installation is for inference only
pip install -v .
# for development mode, `pip install -v -e .`
# to skip dependencies, `pip install -v -e . --no-deps`
pip install flash-attn --no-build-isolation
# replace
PYTHON_SITE_PACKAGES=$(python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())")
cp assets/src/pytorchvideo_augmentations.py ${PYTHON_SITE_PACKAGES}/pytorchvideo/transforms/augmentations.py
cp assets/src/funasr_utils_load_utils.py ${PYTHON_SITE_PACKAGES}/funasr/utils/load_utils.py| Version | Base Model | Resolution | Duration | Model Size |
|---|---|---|---|---|
| JavisDiT-v1.0 | Wan2.1-1.3B | 240P-480P | 2s-5s | 2.1B |
Run the following command to download the pre-trained weights.
# download JavisDiT weights
hf download JavisVerse/JavisDiT-v1.0-jav --local-dir ./checkpoints/JavisDiT-v1.0-jav
# download VAEs
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./checkpoints/Wan2.1-T2V-1.3B
hf download cvssp/audioldm2 --local-dir ./checkpoints/audioldm2The following command will generate a standard 480P sounding video for 5 seconds at 16 FPS:
python scripts/inference.py \
configs/javisdit-v1-0/inference/sample.py \
--num-frames 81 --resolution 480p --aspect-ratio 9:16 \
--model-path ./checkpoints/JavisDiT-v1.0-jav \
--prompt "A brown bear is walking towards the camera, growling in a natural setting with greenery in the background." \
--verbose 2--verbose 2 will display the progress of a single diffusion.
If you want to generate 240P, 4 second videos on a given prompt list (organized with a .txt for .csv file):
python scripts/inference.py \
configs/javisdit-v1-0/inference/sample.py \
--num-frames 65 --resolution 240p --aspect-ratio 9:16 \
--model-path ./checkpoints/JavisDiT-v1.0-jav \
--prompt-path data/meta/JavisBench.csv --verbose 1--verbose 1 will display the progress of the whole generation list.
To enable multi-device inference, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 scripts/inference.py \
configs/javisdit-v1-0/inference/sample.py \
--num-frames 65 --resolution 240p --aspect-ratio 9:16 \
--model-path ./checkpoints/JavisDiT-v1.0-jav \
--prompt-path data/meta/JavisBench.csv --verbose 1We follow OpenSora to use a .csv file to manage all the training entries and their attributes for efficient training:
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text |
|---|---|---|---|---|---|---|---|---|---|---|---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | yyy |
The content of columns may vary in different training stages. The detailed instructions for each training stage can be found in assets/docs/data.md.
This stage performs audio pretraining to intialize the text-to-audio generation. Following the instructions to prepare the audio data, or just download the preprocessed data from HuggingFace:
hf download --repo-type dataset JavisVerse/JavisData-Audio --local-dir /path/to/audioThen, run the command:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --standalone --nproc_per_node 8 \
scripts/train.py \
configs/javisdit-v1-0/train/stage1_audio.py \
--data-path data/meta/audio/train_audio.csvThe resulting checkpoints will be saved at runs/0aa-Wan2_1_T2V_1_3B/epoch0bb-global_stepccc/model, and you can move the weights to outputs/stage1_audio_pt/model.
This stage finetunes the T2V+T2A backbone for preliminary T2AV generation.
Following the instructions to prepare the audio-video data, and the video_id list used in JavisDiT++ is provided at assets/meta/AV_SFT_330K_video_ids.txt, as we cannot release the raw YouTube videos used by TAVGBench due to copyright issues.
Then, run the command:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --standalone --nproc_per_node 8 \
scripts/train.py \
configs/javisdit-v1-0/train/stage2_audio_video.py \
--data-path data/meta/video/train_av_sft.csvThe resulting checkpoints will be saved at runs/0xx-Wan2_1_T2V_1_3B/epoch0yy-global_stepzzz/*, and you can move the weights to outputs/stage2_av_sft/*.
This stage deploy DPO to improve human-preference alignment of T2AV generation. Following the instructions to prepare the audio-video preference data. We are working on releasing our generated data.
Then, run the command:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --standalone --nproc_per_node 8 \
scripts/train.py \
configs/javisdit-v1-0/train/stage3_audio_video_dpo.py \
--data-path data/meta/avdpo/train_av_dpo.csvThe resulting checkpoints will be saved at runs/0aa-Wan2_1_T2V_1_3B/epoch0bb-global_stepccc/model.
mv runs/0aa-Wan2_1_T2V_1_3B/epoch0bb-global_stepccc ./outputs/JavisDiT-v1.0-javInstall necessary packages:
pip install -r requirements/requirements-eval.txtDownload the meta file and data of JavisBench, and put them into data/eval/:
mkdir -p data/eval
hf download --repo-type dataset JavisVerse/JavisBench --local-dir data/eval/JavisBenchRun the joint audio-video generation (JAVG) inference to generate sounding videos in 240P for 4 seconds:
DATASET="JavisBench" # or JavisBench-mini
prompt_path="data/eval/JavisBench/${DATASET}.csv"
cfg_file="configs/javisdit-v1-0/inference/sample.py"
model_path="./checkpoints/JavisDiT-v1.0-jav"
save_dir="samples/${DATASET}"
resolution=240p
num_frames=4s
aspect_ratio="9:16"
rm -rf ${save_dir}
python scripts/inference.py ${cfg_file} \
--resolution ${resolution} --num-frames ${num_frames} --aspect-ratio ${aspect_ratio} \
--prompt-path ${prompt_path} --model-path ${model_path} \
--save-dir ${save_dir} --verbose 1
# (Optional, for evaluation) Extract audios from generated videos
python -m tools.datasets.convert video ${save_dir} --output ${save_dir}/meta.csv
python -m tools.datasets.datautil ${save_dir}/meta.csv --extract-audio --audio-sr 16000
rm -f ${save_dir}/meta*.csvRun the following code and the results will be saved in ./evaluation_results.
MAX_FRAMES=16
IMAGE_SIZE=224
MAX_AUDIO_LEN_S=4.0
# Params to calculate JavisScore
WINDOW_SIZE_S=2.0
WINDOW_OVERLAP_S=1.5
METRICS="all"
RESULTS_DIR="./evaluation_results"
DATASET="JavisBench" # or JavisBench-mini
INPUT_FILE="data/eval/JavisBench/${DATASET}.csv"
FVD_AVCACHE_PATH="data/eval/JavisBench/cache/fvd_fad/${DATASET}-vanilla-max4s.pt"
INFER_DATA_DIR="samples/${DATASET}"
python -m eval.javisbench.main \
--input_file "${INPUT_FILE}" \
--infer_data_dir "${INFER_DATA_DIR}" \
--output_file "${RESULTS_DIR}/${DATASET}.json" \
--max_frames ${MAX_FRAMES} \
--image_size ${IMAGE_SIZE} \
--max_audio_len_s ${MAX_AUDIO_LEN_S} \
--window_size_s ${WINDOW_SIZE_S} \
--window_overlap_s ${WINDOW_OVERLAP_S} \
--fvd_avcache_path ${FVD_AVCACHE_PATH} \
--metrics ${METRICS}Below we show our appreciation for the exceptional work and generous contribution to open source. Special thanks go to the authors of Open-Sora, Wan-Video, AudioLDM2, and TAVGBench for their valuable codebase and dataset. For other works and datasets, please refer to our paper.
If you find JavisDiT is useful and use it in your project, please kindly cite:
@inproceedings{liu2025javisdit,
title = {JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization},
author = {Liu, Kai and Li, Wei and Chen, Lai and Wu, Shengqiong and Zheng, Yanhao and Ji, Jiayi and Zhou, Fan and Luo, Jiebo and Liu, Ziwei and Fei, Hao and Chua, Tat-Seng},
conference = {The Fourteenth International Conference on Learning Representations},
year = {2026},
}
@inproceedings{liu2026javisdit++,
title = {JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation},
author = {Liu, Kai and Zheng, Yanhao and Wang, Kai and Wu, Shengqiong and Zhang, Rongjunchen and Luo, Jiebo and Hatzinakos, Dimitrios and Liu, Ziwei and Fei, Hao and Chua, Tat-Seng},
booktitle = {The Fourteenth International Conference on Learning Representations},
year = {2026},
}