🚀 RefAlign: Representation Alignment for Reference-to-Video Generation

arXiv 2026
A training-time alignment framework for improving reference fidelity, identity consistency, and text controllability in reference-to-video generation.

Lei Wang^1,2,*,‡, Yuxin Song^2,‡, Ge Wu¹, Haocheng Feng², Hang Zhou², Jingdong Wang², Yaxing Wang^4†, Jian Yang^1,3†

¹ PCA Lab, VCIP, College of Computer Science, Nankai University ² Baidu Inc. ³ PCA Lab, School of Intelligence Science and Technology, Nanjing University ⁴ College of Artificial Intelligence, Jilin University

† Corresponding authors * Interns at Baidu Inc. ‡ Equal contribution

🔥 Why RefAlign?

Reference-to-video (R2V) generation often suffers from two practical issues:

copy-paste artifacts
multi-subject confusion

RefAlign addresses these issues by explicitly aligning DiT reference-branch features to the feature space of a frozen visual foundation model (VFM) during training.

Key advantages

Better reference fidelity
Improved identity consistency
Stronger semantic discrimination
No inference-time overhead
State-of-the-art results on OpenS2V-Eval

In short: RefAlign improves reference-consistent video generation through explicit representation alignment.

🏆 Highlights

RefAlign achieves state-of-the-art performance on OpenS2V-Eval.
We release both 1.3B and 14B checkpoints.
The method is applied only during training, so the alignment module and VFM are discarded at inference time.
RefAlign outperforms strong open-source baselines and is competitive with / better than several closed-source systems.

🖼️ Overview

🎬 Demo

Reference Images	Output Video
	faceobj_1.mp4 Open Video
	faceobj_2.mp4 Open Video
	ref3.mp4 Open Video
	ref4.1.mp4 Open Video

🏆 OpenS2V-Eval Leaderboard

RefAlign achieves SOTA performance on OpenS2V-Eval across multiple metrics.

Model	Venue	TotalScore ↑	Aesthetic ↑	MotionSmoothness ↑	MotionAmplitude ↑	FaceSim ↑	GmeScore ↑	NexusScore ↑	NaturalScore ↑
🥇 RefAlign-14B (Ours)	Open-Source	60.42%	46.84%	97.61%	22.48%	55.23%	68.32%	48.52%	73.63%
🥇 RefAlign-1.3B (Ours)	Open-Source	56.30%	42.96%	94.74%	20.74%	53.06%	66.85%	43.97%	66.25%
Saber	Closed-Source	57.91%	42.42%	96.12%	21.12%	49.89%	67.50%	47.22%	72.55%
VINO	Open-Source	57.85%	45.92%	94.73%	12.30%	52.00%	69.69%	42.67%	71.99%
BindWeave	Closed-Source	57.61%	45.55%	95.90%	13.91%	53.71%	67.79%	46.84%	66.85%
VACE-14B	Open-Source	57.55%	47.21%	94.97%	15.02%	55.09%	67.27%	44.08%	67.04%
Phantom-14B	Open-Source	56.77%	46.39%	96.31%	33.42%	51.46%	70.65%	37.43%	69.35%
Kling1.6	Closed-Source	56.23%	44.59%	86.93%	41.60%	40.10%	66.20%	45.89%	74.59%
Phantom-1.3B	Open-Source	54.89%	46.67%	93.30%	14.29%	48.56%	69.43%	42.48%	62.50%
MAGREF-480P	Open-Source	52.51%	45.02%	93.17%	21.81%	30.83%	70.47%	43.04%	66.90%
SkyReels-A2-P14B	Open-Source	52.25%	39.41%	87.93%	25.60%	45.95%	64.54%	43.75%	60.32%
Vidu2.0	Closed-Source	51.95%	41.48%	90.45%	13.52%	35.11%	67.57%	43.37%	65.88%

💡 Motivation

Motivation of RefAlign. (a) R2V generation suffers from copy-paste artifacts and multi-subject confusion. (b) t-SNE visualization shows that DiT reference features are highly entangled, while DINOv3 features are more separable. RefAlign aligns DiT features to the DINOv3 feature space to improve reference separability. (c) Visual comparison with and without RefAlign.

📘 Introduction

Reference-to-video (R2V) generation is a controllable video synthesis setting where both text prompts and reference images are used to guide video generation. It is useful for applications such as personalized advertising, virtual try-on, and identity-consistent video creation.

Most existing R2V methods introduce additional high-level semantic or cross-modal features on top of the VAE latent representation and jointly feed them into a diffusion Transformer (DiT). Although these auxiliary features can provide useful semantic guidance, they often remain insufficient to resolve the modality mismatch across heterogeneous encoders, which leads to issues such as copy-paste artifacts and multi-subject confusion.

To address this, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a frozen visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls same-subject pairs closer and pushes different-subject pairs farther apart, improving both identity consistency and semantic discriminability.

RefAlign is used only during training, which means the alignment process and the VFM are discarded at inference time, introducing no extra inference overhead. Extensive experiments on OpenS2V-Eval demonstrate that RefAlign achieves state-of-the-art TotalScore, validating the effectiveness of explicit representation alignment for R2V generation.

🧠 Method

(a) Overview of RefAlign. During training, the proposed reference alignment loss is applied to intermediate features in selected DiT blocks and aligns them to target features extracted by a frozen visual foundation model. During inference, both the alignment process and the VFM are removed. (b) Illustration of the reference alignment loss, which pulls matched pairs together and pushes mismatched pairs apart.

✨ Qualitative Results

Qualitative comparison with existing methods.

📈 Quantitative Results

⚡ Quick Start

Installation

git clone https://github.com/gudaochangsheng/RefAlign.git
cd RefAlign

conda create -n refalign python=3.8 -y
conda activate refalign

pip install -r requirements.txt

Inference

# Inference with RefAlign-1.3B
python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py

# Inference with RefAlign-14B
python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py

📦 Model Zoo

Model	Params	Hugging Face	ModelScope
RefAlign-1.3B	1.3B
RefAlign-14B	14B

🏋️ Training

# Train RefAlign-1.3B: stage 1 (OpenS2V)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-1.3B_stage1.sh

# Train RefAlign-1.3B: stage 2 (Phantom-Data)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-1.3B_stage2.sh

# Train RefAlign-14B: stage 1 (OpenS2V)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-14B_stage1.sh

# Train RefAlign-14B: stage 2 (Phantom-Data)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-14B_stage2.sh

✅ Updates

📌 Notes

RefAlign is a training-time alignment strategy.
The alignment loss improves reference-aware generation without adding inference-time cost.
Both 1.3B and 14B models are provided for practical use and comparison.

📚 Citation

If you find RefAlign useful, please consider giving this repository a star ⭐ and citing our paper.

@article{wang2026refalign,
  title={RefAlign: Representation Alignment for Reference-to-Video Generation},
  author={Wang, Lei and Song, Yuxin and Wu, Ge and Feng, Haocheng and Zhou, Hang and Wang, Jingdong and Wang, Yaxing and Yang, Jian},
  journal={arXiv preprint arXiv:2603.25743},
  year={2026}
}

🙏 Acknowledgements

This project is based on DiffSynth-Studio.
We sincerely acknowledge the inspiring prior work: Phantom, VINO, OpenS2V, Phantom-Data, and Wan2.1.

📮 Contact

If you have any questions, please feel free to contact:

scitop1998@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
apps		apps
asserts		asserts
diffsynth.egg-info		diffsynth.egg-info
diffsynth		diffsynth
examples		examples
models		models
viz_feat		viz_feat
.deepspeed_env		.deepspeed_env
.msc		.msc
.mv		.mv
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
current_packages.txt		current_packages.txt
requirements.txt		requirements.txt
setup.py		setup.py
train.md		train.md
video_FLF2V_example.mp4		video_FLF2V_example.mp4
video_Fun-V11-14B-depth.mp4		video_Fun-V11-14B-depth.mp4
video_Wan2.1-T2V-1.3B.mp4		video_Wan2.1-T2V-1.3B.mp4
video_Wan2.1-T2V-1.3B_1.mp4		video_Wan2.1-T2V-1.3B_1.mp4
video_Wan2.1-T2V-14B_lora_sea.mp4		video_Wan2.1-T2V-14B_lora_sea.mp4
video_control_example.mp4		video_control_example.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 RefAlign: Representation Alignment for Reference-to-Video Generation

🔥 Why RefAlign?

Key advantages

🏆 Highlights

🖼️ Overview

🎬 Demo

🏆 OpenS2V-Eval Leaderboard

💡 Motivation

📘 Introduction

🧠 Method

✨ Qualitative Results

📈 Quantitative Results

⚡ Quick Start

Installation

Inference

📦 Model Zoo

🏋️ Training

✅ Updates

📌 Notes

📚 Citation

🙏 Acknowledgements

📮 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 RefAlign: Representation Alignment for Reference-to-Video Generation

🔥 Why RefAlign?

Key advantages

🏆 Highlights

🖼️ Overview

🎬 Demo

🏆 OpenS2V-Eval Leaderboard

💡 Motivation

📘 Introduction

🧠 Method

✨ Qualitative Results

📈 Quantitative Results

⚡ Quick Start

Installation

Inference

📦 Model Zoo

🏋️ Training

✅ Updates

📌 Notes

📚 Citation

🙏 Acknowledgements

📮 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages