Skip to content

gudaochangsheng/RefAlign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

48 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

RefAlign logo

๐Ÿš€ RefAlign: Representation Alignment for Reference-to-Video Generation

arXiv 2026
A training-time alignment framework for improving reference fidelity, identity consistency, and text controllability in reference-to-video generation.

arXiv Paper PDF Project Page Code Visitors

HF 1.3B HF 14B MS 1.3B MS 14B

Lei Wang1,2,*,โ€ก, Yuxin Song2,โ€ก, Ge Wu1, Haocheng Feng2, Hang Zhou2, Jingdong Wang2, Yaxing Wang4โ€ , Jian Yang1,3โ€ 
1 PCA Lab, VCIP, College of Computer Science, Nankai University ย ย  2 Baidu Inc. ย ย  3 PCA Lab, School of Intelligence Science and Technology, Nanjing University ย ย  4 College of Artificial Intelligence, Jilin University
โ€  Corresponding authors ย ย  * Interns at Baidu Inc. ย ย  โ€ก Equal contribution

๐Ÿ”ฅ Why RefAlign?

Reference-to-video (R2V) generation often suffers from two practical issues:

  • copy-paste artifacts
  • multi-subject confusion

RefAlign addresses these issues by explicitly aligning DiT reference-branch features to the feature space of a frozen visual foundation model (VFM) during training.

Key advantages

  • Better reference fidelity
  • Improved identity consistency
  • Stronger semantic discrimination
  • No inference-time overhead
  • State-of-the-art results on OpenS2V-Eval

In short: RefAlign improves reference-consistent video generation through explicit representation alignment.


๐Ÿ† Highlights

  • RefAlign achieves state-of-the-art performance on OpenS2V-Eval.
  • We release both 1.3B and 14B checkpoints.
  • The method is applied only during training, so the alignment module and VFM are discarded at inference time.
  • RefAlign outperforms strong open-source baselines and is competitive with / better than several closed-source systems.

๐Ÿ–ผ๏ธ Overview

RefAlign abstract

๐ŸŽฌ Demo

Reference Images Output Video
faceobj_1.mp4

Open Video
faceobj_2.mp4

Open Video
ref3.mp4

Open Video
ref4.1.mp4

Open Video

๐Ÿ† OpenS2V-Eval Leaderboard

RefAlign achieves SOTA performance on OpenS2V-Eval across multiple metrics.

Model Venue TotalScore โ†‘ Aesthetic โ†‘ MotionSmoothness โ†‘ MotionAmplitude โ†‘ FaceSim โ†‘ GmeScore โ†‘ NexusScore โ†‘ NaturalScore โ†‘
๐Ÿฅ‡ RefAlign-14B (Ours) Open-Source 60.42% 46.84% 97.61% 22.48% 55.23% 68.32% 48.52% 73.63%
๐Ÿฅ‡ RefAlign-1.3B (Ours) Open-Source 56.30% 42.96% 94.74% 20.74% 53.06% 66.85% 43.97% 66.25%
Saber Closed-Source 57.91% 42.42% 96.12% 21.12% 49.89% 67.50% 47.22% 72.55%
VINO Open-Source 57.85% 45.92% 94.73% 12.30% 52.00% 69.69% 42.67% 71.99%
BindWeave Closed-Source 57.61% 45.55% 95.90% 13.91% 53.71% 67.79% 46.84% 66.85%
VACE-14B Open-Source 57.55% 47.21% 94.97% 15.02% 55.09% 67.27% 44.08% 67.04%
Phantom-14B Open-Source 56.77% 46.39% 96.31% 33.42% 51.46% 70.65% 37.43% 69.35%
Kling1.6 Closed-Source 56.23% 44.59% 86.93% 41.60% 40.10% 66.20% 45.89% 74.59%
Phantom-1.3B Open-Source 54.89% 46.67% 93.30% 14.29% 48.56% 69.43% 42.48% 62.50%
MAGREF-480P Open-Source 52.51% 45.02% 93.17% 21.81% 30.83% 70.47% 43.04% 66.90%
SkyReels-A2-P14B Open-Source 52.25% 39.41% 87.93% 25.60% 45.95% 64.54% 43.75% 60.32%
Vidu2.0 Closed-Source 51.95% 41.48% 90.45% 13.52% 35.11% 67.57% 43.37% 65.88%

๐Ÿ’ก Motivation

RefAlign motivation
Motivation of RefAlign. (a) R2V generation suffers from copy-paste artifacts and multi-subject confusion. (b) t-SNE visualization shows that DiT reference features are highly entangled, while DINOv3 features are more separable. RefAlign aligns DiT features to the DINOv3 feature space to improve reference separability. (c) Visual comparison with and without RefAlign.

๐Ÿ“˜ Introduction

Reference-to-video (R2V) generation is a controllable video synthesis setting where both text prompts and reference images are used to guide video generation. It is useful for applications such as personalized advertising, virtual try-on, and identity-consistent video creation.

Most existing R2V methods introduce additional high-level semantic or cross-modal features on top of the VAE latent representation and jointly feed them into a diffusion Transformer (DiT). Although these auxiliary features can provide useful semantic guidance, they often remain insufficient to resolve the modality mismatch across heterogeneous encoders, which leads to issues such as copy-paste artifacts and multi-subject confusion.

To address this, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a frozen visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls same-subject pairs closer and pushes different-subject pairs farther apart, improving both identity consistency and semantic discriminability.

RefAlign is used only during training, which means the alignment process and the VFM are discarded at inference time, introducing no extra inference overhead. Extensive experiments on OpenS2V-Eval demonstrate that RefAlign achieves state-of-the-art TotalScore, validating the effectiveness of explicit representation alignment for R2V generation.


๐Ÿง  Method

RefAlign method
(a) Overview of RefAlign. During training, the proposed reference alignment loss is applied to intermediate features in selected DiT blocks and aligns them to target features extracted by a frozen visual foundation model. During inference, both the alignment process and the VFM are removed. (b) Illustration of the reference alignment loss, which pulls matched pairs together and pushes mismatched pairs apart.

โœจ Qualitative Results

Qualitative comparison with existing methods.
RefAlign qualitative results

๐Ÿ“ˆ Quantitative Results

RefAlign quantitative results


โšก Quick Start

Installation

git clone https://github.com/gudaochangsheng/RefAlign.git
cd RefAlign

conda create -n refalign python=3.8 -y
conda activate refalign

pip install -r requirements.txt

Inference

# Inference with RefAlign-1.3B
python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py

# Inference with RefAlign-14B
python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py

๐Ÿ“ฆ Model Zoo

Model Params Hugging Face ModelScope
RefAlign-1.3B 1.3B HF Download MS Download
RefAlign-14B 14B HF Download MS Download

๐Ÿ‹๏ธ Training

# Train RefAlign-1.3B: stage 1 (OpenS2V)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-1.3B_stage1.sh

# Train RefAlign-1.3B: stage 2 (Phantom-Data)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-1.3B_stage2.sh

# Train RefAlign-14B: stage 1 (OpenS2V)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-14B_stage1.sh

# Train RefAlign-14B: stage 2 (Phantom-Data)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-14B_stage2.sh

โœ… Updates

  • Release paper
  • Release project page
  • Release 1.3B checkpoint
  • Release 14B checkpoint
  • Release inference code
  • Release training scripts
  • Add more demos
  • Add more ablation results

๐Ÿ“Œ Notes

  • RefAlign is a training-time alignment strategy.
  • The alignment loss improves reference-aware generation without adding inference-time cost.
  • Both 1.3B and 14B models are provided for practical use and comparison.

๐Ÿ“š Citation

If you find RefAlign useful, please consider giving this repository a star โญ and citing our paper.

@article{wang2026refalign,
  title={RefAlign: Representation Alignment for Reference-to-Video Generation},
  author={Wang, Lei and Song, Yuxin and Wu, Ge and Feng, Haocheng and Zhou, Hang and Wang, Jingdong and Wang, Yaxing and Yang, Jian},
  journal={arXiv preprint arXiv:2603.25743},
  year={2026}
}

๐Ÿ™ Acknowledgements

This project is based on DiffSynth-Studio.
We sincerely acknowledge the inspiring prior work: Phantom, VINO, OpenS2V, Phantom-Data, and Wan2.1.


๐Ÿ“ฎ Contact

If you have any questions, please feel free to contact:

scitop1998@gmail.com