-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Hi authors,
Thank you for your excellent work on MovieChat. It's a very clever and effective approach for long video understanding. I'm studying your method and have a few technical questions about the feature processing pipeline. I would be grateful for your clarification.
1. Generality of the Memory Mechanism
Conceptually, your memory consolidation process seems highly adaptable. I was wondering if it could be applied to other MLLM architectures, especially those that use a simpler projector (e.g., a single MLP layer) instead of a Q-Former?
2. Q-Former's Handling of "Synthetic" Features
Your method merges features from several frames (from EVA-CLIP) into a "synthetic" feature before feeding it to the BLIP-2 Q-Former. Since the Q-Former was pre-trained on original, non-merged features from a vision encoder, how does it generalize to correctly understand these "mixed" features created by your merging process?
3. The Training-Free Projector/Connection to the LLM
My main question is about the final connection to the LLM. In standard MLLMs, the projector is a critical component that is explicitly trained to map features from a specific vision encoder into the LLM's semantic space.
Usually, if the input visual features are altered, one would need to retrain the projector and fine-tune the LLM to align the new feature distribution. Your framework impressively bypasses this entire training step.
Could you elaborate on what makes this possible? After the features go through the complex chain (EVA-CLIP ViT -> Merging -> BLIP-2 Q-Former), what is the key insight that allows Vicuna to understand them without a dedicated training phase for the connection? Is it that the BLIP-2 Q-Former's output is a "universal" representation that is already well-aligned with the input space of most LLMs?
Thank you for your time and for sharing this fantastic research.