You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We test VideoScore2 on the test set of VideoFeedback2, containing 500 videos with human scores from three dimensions.
378
377
Below we show the results of some scoring/reward models for image or video, like VideoReward, VisionReward, Q-Insight, DeQA-Score, and MLLM-prompting methods like Claude-Sonnet-4, GPT-5, Gemini-2.5-Pro, etc and our VideoScore2.
<spanstyle="vertical-align: middle">Evaluation on Out-of-Domain Benchmarks</span>
389
+
</h2>
390
+
</div>
391
+
<divclass="container is-max-desktop">
392
+
<divclass="columns is-centered">
393
+
<divclass="column is-full-width">
386
394
<p>
387
-
We further test on four out-of-domain (OOD) benchmarks: two pairwise preference and two point
388
-
score. Preference benchmark results include ties. As shown in Table below, while VideoScore2 is
395
+
We further test on four out-of-domain (OOD) benchmarks: two pairwise preference (VideoGenReward-Bench and T2VQA-DB (preference version)) and two point score (MJ-Bench-Video and Video-Phy2-test).
396
+
397
+
Preference benchmark results include ties. As shown in Table below, while VideoScore2 is
389
398
not always the top model on each benchmark, it achieves the highest overall average.
0 commit comments