GitHub - MCG-NJU/LongVPO: [NeurIPS 2025] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

LongVPO: From Anchored Cues to Self-Reasoning for
Long-Form Video Preference Optimization

Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng,
Lingxue Song, Xi Chen, Liang Li, and Limin Wang;

| |

📖 Overview

We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations.

Stage 1 (Anchored Cues): Synthesizes preference triples by anchoring questions to individual short clips interleaved with distractors. This stage mitigates positional bias and activates long-context retrieval capabilities using only short-video data.
Stage 2 (Self-Reasoning): The model aligns its preferences through multi-segment reasoning tasks, handling longer and more complex dependencies in naive long videos.

With only 16K synthetic examples, LongVPO outperforms state-of-the-art open-source models on multiple long-video benchmarks (e.g., MLVU, LongVideoBench) while maintaining strong short-video performance.

🌟 Key Features

Sequence Parallelism: Implements DeepSpeed Ulysses sequence parallelism on both the Vision Encoder (ViT) and the LLM, efficiently distributing ultra-long video sequences (up to 128K context length) across GPUs during training.
Annotation-Free: Extends context length capabilities using only synthetic data derived from short-video datasets (e.g., LLaVA-Video, Vript), eliminating the need for expensive human long-video labeling.
Two-Stage DPO: A progressive training strategy that first grounds visual cues (Stage 1) and then aligns complex cross scene reasoning (Stage 2).
Strong Performance: Achieves superior performance on MLVU, LongVideoBench, LVBench, and Video-MME with progressive DPO alignment.

Benchmark	Type	InternVL3-8B	LongVPO (Ours)
MLVU	Long	71.4	76.4
LongVideoBench	Long	62.3	66.0
LVBench	Long	48.8	53.6
Video-MME (w/o sub)	Long	66.5	68.9
Video-MME (w/ sub)	Long	72.5	74.0
MVBench	Short	75.4	75.0

🛠️ Environment Setup

Clone the repository:

git clone https://github.com/MCG-NJU/LongVPO.git
cd LongVPO

Install dependencies:

conda create --name longvpo python=3.10
conda activate longvpo
pip install -r requirements.txt
# Install Flash Attention for efficiency
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

📊 Evaluation

Evaluate trained models on comprehensive video understanding benchmarks including MLVU, LongVideoBench, LVBench, Video-MME, and MVBench.

⚡ Long Video-friendly evaluation i/o optimization: We use customized evaluation scripts with optimized PyTorch dataloaders, reducing long-video loading latency by approximately 50% through parallel processing.

bash shell/eval_bench/evaluate.sh

Custom evaluation command:

# Evaluate on a specific benchmark (e.g., Video-MME)
torchrun --nproc_per_node=8 \
    eval_bench/evaluate_videomme.py \
    --datasets videomme \
    --data_path /path/to/Video-MME \
    --checkpoint MCG-NJU/LongVPO-Stage2-InternVL3-8B \
    --num_segments 512 \
    --fps 1 \
    --out-dir output/

🚂 Training

📂 Data Preparation

Frame Extraction: Before training, extract frames from your video datasets (i.e.g., LLaVA-Video-178K) for efficient I/O during training.

# shell/data_pipeline/extract_frames.sh
python data_pipeline/extract_frames.py \
    --config anno/stage1/stage1.json \
    --output_dir data/frames \
    --fps 1 \
    --num_workers 8 \
    --backend decord

Parameters:

--config: Path to the annotation JSON file containing video paths.
--output_dir: Directory to save extracted frames.
--fps: Frames per second to extract (default: 1).
--num_workers: Number of parallel workers.

📂 Training Scripts

LongVPO involves a two-stage training process. Please ensure you have configured the dataset paths in the scripts.

# Stage 1: Efficient Short-to-Long Learning from Anchored Cues
# Trains on synthetic interleaved short clips to mitigate position bias.
bash shell/train/internvl3/stage1.sh

# Stage 2: Self-Training for Long Video Preference Alignment
# Aligns the model using self-generated reasoning chains on real long videos.
bash shell/train/internvl3/stage2.sh

📜 Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{huang2025longvpo,
  title={Long{VPO}: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization},
  author={Zhenpeng Huang and Jiaqi Li and Zihan Jia and Xinhao Li and Desen Meng and Lingxue Song and Xi Chen and Liang Li and Limin Wang},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={[https://openreview.net/forum?id=LKAp7Dknxf](https://openreview.net/forum?id=LKAp7Dknxf)}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongVPO: From Anchored Cues to Self-Reasoning for
Long-Form Video Preference Optimization

📖 Overview

🌟 Key Features

🛠️ Environment Setup

📊 Evaluation

🚂 Training

📂 Data Preparation

📂 Training Scripts

📜 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
anno/stage1		anno/stage1
data_pipeline		data_pipeline
deepspeed		deepspeed
eval_bench		eval_bench
internvl		internvl
shell		shell
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

MCG-NJU/LongVPO

Folders and files

Latest commit

History

Repository files navigation

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

📖 Overview

🌟 Key Features

🛠️ Environment Setup

📊 Evaluation

🚂 Training

📂 Data Preparation

📂 Training Scripts

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

LongVPO: From Anchored Cues to Self-Reasoning for
Long-Form Video Preference Optimization

Packages