You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VideoScore2 is trained on the VideoFeedback2 dataset containing 27K human-annotated videos with both scores and rationales across three dimensions. We adopt a two-stage pipeline: first, supervised fine-tuning (SFT) on Qwen2.5-VL-7B-Instruct to establish format-following and scoring ability; then, reinforcement learning with Group Relative Policy Optimization (GRPO) to further align model outputs with human judgment and enhance analytical robustness.
357
+
VideoScore2 is trained on the VideoFeedback2 dataset containing 27K human-annotated videos with both scores and rationales across three dimensions. We adopt a two-stage pipeline: first, supervised fine-tuning (SFT) on Qwen2.5-VL-7B-Instruct to establish format-following and scoring ability; then, reinforcement learning with Group Relative Policy Optimization (GRPO) to further align model outputs with human judgment and enhance analytical robustness.
355
358
356
-
Compared to VideoScore (v1), VS2 introduces interpretable scoring for three dimensions (Visual Quality, Text Alignment, Physical/Common-sense Consistency) and CoT-style rationales, achieving stronger generalization on out-of-domain benchmarks while providing transparent and human-aligned video evaluation.
359
+
Compared to VideoScore (v1), VS2 introduces interpretable scoring for three dimensions (Visual Quality, Text Alignment, Physical/Common-sense Consistency) and CoT-style rationales, achieving stronger generalization on out-of-domain benchmarks while providing transparent and human-aligned video evaluation.
0 commit comments