π₯π₯π₯ ICML 2025 Spotlight | MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation
The language-centric pretraining mechanism in existing large-scale multimodal models often results in modality bias, making it challenging to capture fine-grained emotional cues. To address this, the Kuaishou Keling team, in collaboration with Nankai University, conducted pioneering research in the field of "multimodal emotion understanding," successfully identifying critical limitations of current multimodal models in detecting emotional signals. The research team introduced a novel modular duplex attention paradigm by focusing on the dimensions of multimodal attention mechanisms. Building on this framework, they developed a multimodal model named MODA, which integrates capabilities in perception, cognition, and emotion understanding. MODA demonstrated substantial performance improvements across 21 benchmark tests spanning six major task categories: general QA, knowledge QA, table & ocr, visual-centric, cognitive analysis, and emotion understanding. Additionally, thanks to the innovative attention mechanism, MODA excelled in human-computer interaction scenarios such as character profiling and planning deduction. This groundbreaking work has been accepted by ICML 2025 and selected as a Spotlight Paper (Top 2.6%). β¨
π₯π₯π₯ CVPR 2024 | MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation
When labels are extremely scarce, there is an urgent need to explore a low-cost, annotation-free large-scale supervision signal. To address this, a masked emotion modeling method is proposed, which leverages linguistic emotional cues from videos to reconstruct the temporal distribution of emotions for learning discriminative representations. β¨
π₯π₯π₯ CVPR 2023 | Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
A cross-modal temporal erasing network for video emotion analysis that locates not only keyframes but also context and audio-related information in a weakly-supervised manner β¨
π₯π₯π₯ ACM MM 2022 | Temporal Sentiment Localization: Listen and Look in Untrimmed Videos
Due to the high cost of labeling a densely annotated dataset, we propose TSL-Net in this work, employing single-frame supervision to localize sentiment in videos β¨
π₯π₯π₯ AAAI 2020 | An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos
π₯π₯π₯ TAC 2024 | Looking into Gait for Perceiving Emotions via Bilateral Posture and Movement Graph Convolutional Networks



