MBZUAI, University of California Merced, Google Research, Australian National University, Linköping University
Code, Dataset and Model will be released soon. Stay tuned! 🚀
Video-CoM introduces a new paradigm for interactive video reasoning, enabling models to think with videos instead of merely thinking about them. Instead of relying on a single static video encoding, Video-CoM performs iterative visual actions (segment finding, frame selection, and spatial zooming) to actively gather evidence through a Chain of Manipulations (CoM).

Video-Com reasons with videos through a coherent chain of manipulations, actively gathering and integrating visual evidence throughout reasoning.
-
Interactive Video Reasoning Framework: Moves beyond passive video encoding by enabling the model to actively rewatch specific moments, pause on key frames, and zoom into fine details throughout its reasoning trajectory, allowing it to gather evidence step by step rather than relying on a single static video representation.
-
Chain of Manipulations (CoM): A structured, interpretable reasoning mechanism where each step involves retrieving new visual evidence before continuing textual reasoning.
-
Video-CoM-Instruct (18K) - Manipulation-Driven Dataset: Carefully curated videos + dense annotations designed specifically for active visual reasoning.
-
Reasoning-Aware GRPO (RA-GRPO): Unlike accuracy-only RL, RA-GRPO provides step-level reasoning rewards, guiding consistent and visually grounded reasoning.
-
Srong Performance: We show strong performance across five reasoning benchmarks and two generic video-understanding benchmarks, along with significant gains on our manipulation-focused benchmark, demonstrating the effectiveness of interactive visual reasoning.
The Video-CoM-Instruct is constructed through three key stages:
- Curating information-dense videos suited for fine-grained reasoning
- Generating manipulation-targeted QA pairs that require segment revisiting, frame inspection, and spatial zooming
- Dense temporal and spatial annotations to enable step-level reinforcement learning
Building on this foundation, each example follows a structured reasoning format that alternates between exploratory reasoning, where the model infers which moment or region likely contains the needed evidence; visual manipulation, where it executes targeted actions such as find-segment, find-frame, or spatial-zoom to retrieve new visual input; and observation, where it interprets the newly revealed evidence and integrates it into the next step.
Most existing video reasoning models rely solely on final-answer rewards, offering no guidance on whether intermediate reasoning steps are visually grounded or correct. To address this, we introduce reasoning-aware rewards enabled by our dense temporal and spatial annotations, allowing the model to receive feedback at every manipulation step. Reasoning-Aware GRPO (RA-GRPO) enhances interactive video reasoning by providing step-level rewards by evaluating the correctness of predicted manipulations.
Video-CoM maintains dynamic visual attention throughout its reasoning process, re-engaging with frames and regions whenever new evidence is needed. Unlike prior models that tend to drift toward text tokens and rely on world knowledge, Video-CoM’s attention consistently anchors to vision tokens at each manipulation step, whether locating a segment, isolating a frame, or zooming into fine details.
@article{rasheed2025videocom,
title={Video-CoM: Interactive Video Reasoning via Chain of Manipulations},
author={Rasheed, Hanoona and Zumri, Mohammed and Maaz, Muhammad and Yang, Ming-Hsuan and Khan, Fahad S. and Khan, Salman},
journal={arXiv preprint arXiv:2511.23477},
year={2025}
}





