Skip to content

[NeurIPS 2025] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Notifications You must be signed in to change notification settings

MCG-NJU/LongVPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📖 Overview

We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations.

  • Stage 1 (Anchored Cues): Synthesizes preference triples by anchoring questions to individual short clips interleaved with distractors. This stage mitigates positional bias and activates long-context retrieval capabilities using only short-video data.

  • Stage 2 (Self-Reasoning): The model aligns its preferences through multi-segment reasoning tasks, handling longer and more complex dependencies in naive long videos.

With only 16K synthetic examples, LongVPO outperforms state-of-the-art open-source models on multiple long-video benchmarks (e.g., MLVU, LongVideoBench) while maintaining strong short-video performance.

🌟 Key Features

  • Sequence Parallelism: Implements DeepSpeed Ulysses sequence parallelism on both the Vision Encoder (ViT) and the LLM, efficiently distributing ultra-long video sequences (up to 128K context length) across GPUs during training.
  • Annotation-Free: Extends context length capabilities using only synthetic data derived from short-video datasets (e.g., LLaVA-Video, Vript), eliminating the need for expensive human long-video labeling.
  • Two-Stage DPO: A progressive training strategy that first grounds visual cues (Stage 1) and then aligns complex cross scene reasoning (Stage 2).
  • Strong Performance: Achieves superior performance on MLVU, LongVideoBench, LVBench, and Video-MME with progressive DPO alignment.
Benchmark Type InternVL3-8B LongVPO (Ours)
MLVU Long 71.4 76.4
LongVideoBench Long 62.3 66.0
LVBench Long 48.8 53.6
Video-MME (w/o sub) Long 66.5 68.9
Video-MME (w/ sub) Long 72.5 74.0
MVBench Short 75.4 75.0

🛠️ Environment Setup

  1. Clone the repository:
git clone https://github.com/MCG-NJU/LongVPO.git
cd LongVPO
  1. Install dependencies:
conda create --name longvpo python=3.10
conda activate longvpo
pip install -r requirements.txt
# Install Flash Attention for efficiency
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

📊 Evaluation

Evaluate trained models on comprehensive video understanding benchmarks including MLVU, LongVideoBench, LVBench, Video-MME, and MVBench.

⚡ Long Video-friendly evaluation i/o optimization: We use customized evaluation scripts with optimized PyTorch dataloaders, reducing long-video loading latency by approximately 50% through parallel processing.

bash shell/eval_bench/evaluate.sh

Custom evaluation command:

# Evaluate on a specific benchmark (e.g., Video-MME)
torchrun --nproc_per_node=8 \
    eval_bench/evaluate_videomme.py \
    --datasets videomme \
    --data_path /path/to/Video-MME \
    --checkpoint MCG-NJU/LongVPO-Stage2-InternVL3-8B \
    --num_segments 512 \
    --fps 1 \
    --out-dir output/

🚂 Training

📂 Data Preparation

Frame Extraction: Before training, extract frames from your video datasets (i.e.g., LLaVA-Video-178K) for efficient I/O during training.

# shell/data_pipeline/extract_frames.sh
python data_pipeline/extract_frames.py \
    --config anno/stage1/stage1.json \
    --output_dir data/frames \
    --fps 1 \
    --num_workers 8 \
    --backend decord

Parameters:

  • --config: Path to the annotation JSON file containing video paths.
  • --output_dir: Directory to save extracted frames.
  • --fps: Frames per second to extract (default: 1).
  • --num_workers: Number of parallel workers.

📂 Training Scripts

LongVPO involves a two-stage training process. Please ensure you have configured the dataset paths in the scripts.

# Stage 1: Efficient Short-to-Long Learning from Anchored Cues
# Trains on synthetic interleaved short clips to mitigate position bias.
bash shell/train/internvl3/stage1.sh

# Stage 2: Self-Training for Long Video Preference Alignment
# Aligns the model using self-generated reasoning chains on real long videos.
bash shell/train/internvl3/stage2.sh

📜 Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{huang2025longvpo,
  title={Long{VPO}: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization},
  author={Zhenpeng Huang and Jiaqi Li and Zihan Jia and Xinhao Li and Desen Meng and Lingxue Song and Xi Chen and Liang Li and Limin Wang},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={[https://openreview.net/forum?id=LKAp7Dknxf](https://openreview.net/forum?id=LKAp7Dknxf)}
}

About

[NeurIPS 2025] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published