Official repo for ICML 2025 paper "RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer"
Access our paper via [arXiv].
Authors: Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu.
In this work, we extend Imbalance Multimodal Learning to dynamic fusion paradigms. We identify the deactivation of dynamic property in attention mechanism and propose a simple yet effective method, RollingQ, to revive the cooperation dynamics in multimodal Transformers.
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenges by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers.
According to requirements.txt
, please run the following command in the shell
conda create -n rollingQ python=3.10
conda activate rollingQ
pip install -r requirements.txt
We conducts our experiments on Kinetic-Sound, CREMA-D and CMU-MOSEI. You can download the original datasets, and follow the preprocessing instructions provided in BalanceBench. Additionally, the preprocessed data is available on Huggingface.
If you find this work useful, please consider citing it.
@article{ni2025rollingq,
title={RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer},
author={Ni, Haotian and Wei, Yake and Liu, Hang and Chen, Gong and Peng, Chong and Lin, Hao and Hu, Di},
journal={arXiv preprint arXiv:2506.11465},
year={2025}
}
This work is supported by National Natural Science Foundation of China (NO.62106272). This work is also supported by Public Computing Cloud, Renmin University of China, and fund for building world-class universities (disciplines) of Renmin University of China.
If you have any detailed questions or suggestions, you can email us at: [email protected]