This repository explores a range of architectures for Multimodal Emotion Recognition (MER), emphasizing the integration of multiple modalities (text, audio, video) to improve sentiment analysis. Each architecture offers unique strengths and trade-offs concerning accuracy, efficiency, and resilience to challenges like misaligned or missing modalities. Transformer-based methods consistently demonstrate the highest effectiveness for MSA tasks.
The MOSI (Multimodal Opinion Sentiment and Intensity) dataset is a widely used benchmark in multimodal sentiment and emotion analysis. It consists of short video clips where speakers express their opinions and emotions, combining three modalities: text (transcriptions of spoken words), audio (vocal tone and pitch), and visual (facial expressions). Each segment is annotated with sentiment intensity scores ranging from -3 (strongly negative) to +3 (strongly positive).
MOSI is commonly used for tasks like multimodal fusion, sentiment prediction, and emotion recognition, making it a crucial dataset for advancing research in human-computer interaction and affective computing.
The MOSEI (Multimodal Sentiment Analysis and Emotion Intensity) dataset is an extension of MOSI, designed to be larger and more diverse. It contains over 23,000 video clips from more than 1,000 speakers, covering various topics and languages. Each clip provides annotations for sentiment (ranging from -3 to +3, like MOSI) and emotion intensity across six primary emotions: happiness, sadness, anger, fear, surprise, and disgust.
MOSEI is also multimodal, combining text, audio, and visual data, making it suitable for tasks like sentiment analysis, emotion recognition, and multimodal fusion. Its scale and diversity make it a key resource for advancing multimodal natural language processing and understanding real-world affective expressions.
-
Early Fusion:
- Combines features from all modalities right after feature extraction.
- Utilizing Gated Recurrent Units (GRU) and Transformers for improved sequential data processing.
- Achieved moderate to good performance, with Transformers outperforming GRUs.
-
Late Fusion:
- Processes each modality independently until the decision stage, where outputs are combined.
- Similar architecture as Early Fusion but delayed integration led to slightly improved performance for some models.
-
Tensor Fusion:
- Employs Tensor Fusion to capture intra- and inter-modal interactions.
- Achieved relatively the same performance as Early and Late Fusion techniques.
-
Low-Rank Tensor Fusion:
- A more efficient variant of Tensor Fusion that projects features into a low-rank tensor space.
- Much more efficient than Tensor Fusion, with the same accuracy.
-
Multimodal Factorization Model:
- Separates representations into shared multimodal factors and modality-specific generative factors.
- Incorporates modality-specific decoders to reconstruct inputs.
- Suffered from overfitting, leading to a discrepancy between training and test accuracies.
-
Multimodal Cyclic Translation Network:
- Uses cyclic translation between modalities to create robust joint representations.
- Captures shared and complementary information across modalities effectively.
- The most parameter-efficient model
- Acceptable accurarcy on MOSI but performed poorly on MOSEI
-
Multimodal Transformer (MulT):
- Utilizes a crossmodal attention mechanism to dynamically fuse information across time steps.
- Handles misalignments between modalities efficiently.
- Demonstrated good performance among the architectures tested.
| Architecture | CMU-MOSI | CMU-MOSEI |
|---|---|---|
| Early Fusion (Transformer) | 75.65 | 71.91 |
| Late Fusion (GRU) | 75.21 | 71.60 |
| Multimodal Transformer | 75.21 | 70.40 |
| Late Fusion (Transformer) | 73.32 | 68.49 |
| Multimodal Cyclic Translation Network | 72.44 | 59.49 |
| Tensor Fusion | 72.30 | 70.45 |
| Low Rank Tensor Fusion | 72.01 | 70.95 |
| Unimodal | 71.28 | 70.01 |
| Early Fusion (GRU) | 66.90 | 49.01 |
| Multimodal Factorization | 63.70 | 56.88 |
| Architecture | Parameters (Million) |
|---|---|
| Early Fusion (Transformer) | ~8.1 |
| Late Fusion (GRU) | ~2.5 |
| Multimodal Transformer | ~3.0 |
| Late Fusion (Transformer) | ~20 |
| Multimodal Cyclic Translation Network | ~0.2 |
| Tensor Fusion | ~5.4 |
| Low Rank Tensor Fusion | ~1.5 |
| Unimodal | ~1.9 |
| Early Fusion (GRU) | ~1.6 |
| Multimodal Factorization | ~1.4 |
All experiments were carried out on Google Colab Pro, utilizing a A100/T4 GPU with High RAM.