Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng,
Lingxue Song, Xi Chen, Liang Li, and Limin Wang;
We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations.
-
Stage 1 (Anchored Cues): Synthesizes preference triples by anchoring questions to individual short clips interleaved with distractors. This stage mitigates positional bias and activates long-context retrieval capabilities using only short-video data.
-
Stage 2 (Self-Reasoning): The model aligns its preferences through multi-segment reasoning tasks, handling longer and more complex dependencies in naive long videos.
With only 16K synthetic examples, LongVPO outperforms state-of-the-art open-source models on multiple long-video benchmarks (e.g., MLVU, LongVideoBench) while maintaining strong short-video performance.
- Sequence Parallelism: Implements DeepSpeed Ulysses sequence parallelism on both the Vision Encoder (ViT) and the LLM, efficiently distributing ultra-long video sequences (up to 128K context length) across GPUs during training.
- Annotation-Free: Extends context length capabilities using only synthetic data derived from short-video datasets (e.g., LLaVA-Video, Vript), eliminating the need for expensive human long-video labeling.
- Two-Stage DPO: A progressive training strategy that first grounds visual cues (Stage 1) and then aligns complex cross scene reasoning (Stage 2).
- Strong Performance: Achieves superior performance on MLVU, LongVideoBench, LVBench, and Video-MME with progressive DPO alignment.
| Benchmark | Type | InternVL3-8B | LongVPO (Ours) |
|---|---|---|---|
| MLVU | Long | 71.4 | 76.4 |
| LongVideoBench | Long | 62.3 | 66.0 |
| LVBench | Long | 48.8 | 53.6 |
| Video-MME (w/o sub) | Long | 66.5 | 68.9 |
| Video-MME (w/ sub) | Long | 72.5 | 74.0 |
| MVBench | Short | 75.4 | 75.0 |
- Clone the repository:
git clone https://github.com/MCG-NJU/LongVPO.git
cd LongVPO- Install dependencies:
conda create --name longvpo python=3.10
conda activate longvpo
pip install -r requirements.txt
# Install Flash Attention for efficiency
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dirEvaluate trained models on comprehensive video understanding benchmarks including MLVU, LongVideoBench, LVBench, Video-MME, and MVBench.
⚡ Long Video-friendly evaluation i/o optimization: We use customized evaluation scripts with optimized PyTorch dataloaders, reducing long-video loading latency by approximately 50% through parallel processing.
bash shell/eval_bench/evaluate.shCustom evaluation command:
# Evaluate on a specific benchmark (e.g., Video-MME)
torchrun --nproc_per_node=8 \
eval_bench/evaluate_videomme.py \
--datasets videomme \
--data_path /path/to/Video-MME \
--checkpoint MCG-NJU/LongVPO-Stage2-InternVL3-8B \
--num_segments 512 \
--fps 1 \
--out-dir output/Frame Extraction: Before training, extract frames from your video datasets (i.e.g., LLaVA-Video-178K) for efficient I/O during training.
# shell/data_pipeline/extract_frames.sh
python data_pipeline/extract_frames.py \
--config anno/stage1/stage1.json \
--output_dir data/frames \
--fps 1 \
--num_workers 8 \
--backend decordParameters:
--config: Path to the annotation JSON file containing video paths.--output_dir: Directory to save extracted frames.--fps: Frames per second to extract (default: 1).--num_workers: Number of parallel workers.
LongVPO involves a two-stage training process. Please ensure you have configured the dataset paths in the scripts.
# Stage 1: Efficient Short-to-Long Learning from Anchored Cues
# Trains on synthetic interleaved short clips to mitigate position bias.
bash shell/train/internvl3/stage1.sh
# Stage 2: Self-Training for Long Video Preference Alignment
# Aligns the model using self-generated reasoning chains on real long videos.
bash shell/train/internvl3/stage2.shIf you find this work useful for your research, please consider citing our paper:
@inproceedings{huang2025longvpo,
title={Long{VPO}: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization},
author={Zhenpeng Huang and Jiaqi Li and Zihan Jia and Xinhao Li and Desen Meng and Lingxue Song and Xi Chen and Liang Li and Limin Wang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={[https://openreview.net/forum?id=LKAp7Dknxf](https://openreview.net/forum?id=LKAp7Dknxf)}
}