arXiv 2026
A training-time alignment framework for improving reference fidelity, identity consistency, and text controllability in reference-to-video generation.
Reference-to-video (R2V) generation often suffers from two practical issues:
- copy-paste artifacts
- multi-subject confusion
RefAlign addresses these issues by explicitly aligning DiT reference-branch features to the feature space of a frozen visual foundation model (VFM) during training.
- Better reference fidelity
- Improved identity consistency
- Stronger semantic discrimination
- No inference-time overhead
- State-of-the-art results on OpenS2V-Eval
In short: RefAlign improves reference-consistent video generation through explicit representation alignment.
- RefAlign achieves state-of-the-art performance on OpenS2V-Eval.
- We release both 1.3B and 14B checkpoints.
- The method is applied only during training, so the alignment module and VFM are discarded at inference time.
- RefAlign outperforms strong open-source baselines and is competitive with / better than several closed-source systems.
| Reference Images | Output Video |
|---|---|
![]() |
faceobj_1.mp4Open Video |
![]() |
faceobj_2.mp4Open Video |
![]() |
ref3.mp4Open Video |
![]() |
ref4.1.mp4Open Video |
RefAlign achieves SOTA performance on OpenS2V-Eval across multiple metrics.
| Model | Venue | TotalScore โ | Aesthetic โ | MotionSmoothness โ | MotionAmplitude โ | FaceSim โ | GmeScore โ | NexusScore โ | NaturalScore โ |
|---|---|---|---|---|---|---|---|---|---|
| ๐ฅ RefAlign-14B (Ours) | Open-Source | 60.42% | 46.84% | 97.61% | 22.48% | 55.23% | 68.32% | 48.52% | 73.63% |
| ๐ฅ RefAlign-1.3B (Ours) | Open-Source | 56.30% | 42.96% | 94.74% | 20.74% | 53.06% | 66.85% | 43.97% | 66.25% |
| Saber | Closed-Source | 57.91% | 42.42% | 96.12% | 21.12% | 49.89% | 67.50% | 47.22% | 72.55% |
| VINO | Open-Source | 57.85% | 45.92% | 94.73% | 12.30% | 52.00% | 69.69% | 42.67% | 71.99% |
| BindWeave | Closed-Source | 57.61% | 45.55% | 95.90% | 13.91% | 53.71% | 67.79% | 46.84% | 66.85% |
| VACE-14B | Open-Source | 57.55% | 47.21% | 94.97% | 15.02% | 55.09% | 67.27% | 44.08% | 67.04% |
| Phantom-14B | Open-Source | 56.77% | 46.39% | 96.31% | 33.42% | 51.46% | 70.65% | 37.43% | 69.35% |
| Kling1.6 | Closed-Source | 56.23% | 44.59% | 86.93% | 41.60% | 40.10% | 66.20% | 45.89% | 74.59% |
| Phantom-1.3B | Open-Source | 54.89% | 46.67% | 93.30% | 14.29% | 48.56% | 69.43% | 42.48% | 62.50% |
| MAGREF-480P | Open-Source | 52.51% | 45.02% | 93.17% | 21.81% | 30.83% | 70.47% | 43.04% | 66.90% |
| SkyReels-A2-P14B | Open-Source | 52.25% | 39.41% | 87.93% | 25.60% | 45.95% | 64.54% | 43.75% | 60.32% |
| Vidu2.0 | Closed-Source | 51.95% | 41.48% | 90.45% | 13.52% | 35.11% | 67.57% | 43.37% | 65.88% |
Motivation of RefAlign. (a) R2V generation suffers from copy-paste artifacts and multi-subject confusion. (b) t-SNE visualization shows that DiT reference features are highly entangled, while DINOv3 features are more separable. RefAlign aligns DiT features to the DINOv3 feature space to improve reference separability. (c) Visual comparison with and without RefAlign.
Reference-to-video (R2V) generation is a controllable video synthesis setting where both text prompts and reference images are used to guide video generation. It is useful for applications such as personalized advertising, virtual try-on, and identity-consistent video creation.
Most existing R2V methods introduce additional high-level semantic or cross-modal features on top of the VAE latent representation and jointly feed them into a diffusion Transformer (DiT). Although these auxiliary features can provide useful semantic guidance, they often remain insufficient to resolve the modality mismatch across heterogeneous encoders, which leads to issues such as copy-paste artifacts and multi-subject confusion.
To address this, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a frozen visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls same-subject pairs closer and pushes different-subject pairs farther apart, improving both identity consistency and semantic discriminability.
RefAlign is used only during training, which means the alignment process and the VFM are discarded at inference time, introducing no extra inference overhead. Extensive experiments on OpenS2V-Eval demonstrate that RefAlign achieves state-of-the-art TotalScore, validating the effectiveness of explicit representation alignment for R2V generation.
(a) Overview of RefAlign. During training, the proposed reference alignment loss is applied to intermediate features in selected DiT blocks and aligns them to target features extracted by a frozen visual foundation model. During inference, both the alignment process and the VFM are removed. (b) Illustration of the reference alignment loss, which pulls matched pairs together and pushes mismatched pairs apart.
git clone https://github.com/gudaochangsheng/RefAlign.git
cd RefAlign
conda create -n refalign python=3.8 -y
conda activate refalign
pip install -r requirements.txt# Inference with RefAlign-1.3B
python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py
# Inference with RefAlign-14B
python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py| Model | Params | Hugging Face | ModelScope |
|---|---|---|---|
| RefAlign-1.3B | 1.3B | ||
| RefAlign-14B | 14B |
# Train RefAlign-1.3B: stage 1 (OpenS2V)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-1.3B_stage1.sh
# Train RefAlign-1.3B: stage 2 (Phantom-Data)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-1.3B_stage2.sh
# Train RefAlign-14B: stage 1 (OpenS2V)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-14B_stage1.sh
# Train RefAlign-14B: stage 2 (Phantom-Data)
sh ./examples/wanvideo/model_training/full/Wan2.1-T2V-14B_stage2.sh- Release paper
- Release project page
- Release 1.3B checkpoint
- Release 14B checkpoint
- Release inference code
- Release training scripts
- Add more demos
- Add more ablation results
- RefAlign is a training-time alignment strategy.
- The alignment loss improves reference-aware generation without adding inference-time cost.
- Both 1.3B and 14B models are provided for practical use and comparison.
If you find RefAlign useful, please consider giving this repository a star โญ and citing our paper.
@article{wang2026refalign,
title={RefAlign: Representation Alignment for Reference-to-Video Generation},
author={Wang, Lei and Song, Yuxin and Wu, Ge and Feng, Haocheng and Zhou, Hang and Wang, Jingdong and Wang, Yaxing and Yang, Jian},
journal={arXiv preprint arXiv:2603.25743},
year={2026}
}This project is based on DiffSynth-Studio.
We sincerely acknowledge the inspiring prior work:
Phantom,
VINO,
OpenS2V,
Phantom-Data,
and Wan2.1.
If you have any questions, please feel free to contact:
scitop1998@gmail.com















