Multimodal Emotion Recognition (Audio-Visual Fusion)

Official Implementation of our Knowledge-Based Systems (Elsevier) 2026 Paper
An audio-video based multi-modal fusion approach for emotion recognition

📄 Paper • 💻 Code • 🧠 PyTorch • 🎯 Multimodal Learning

Authors: S M Jishanul Islam, Sahid Hossain Mustakim, Musfirat Hossain, Mysun Mashira, Nur Islam Shourav, Md Rayhan Ahmed, Salekul Islam, A.K.M Muzahidul Islam, Swakkhar Shatabda

Keywords: Multimodal Emotion Recognition, Speech Emotion Recognition, Audio-Visual Learning, Deep Learning, R(2+1)D, Squeeze-and-Excitation, CNN, PyTorch, Affective Computing, Human Behavior Analysis

🔥 Highlights

🚀 Accepted at Knowledge-Based Systems (Elsevier), 2026 (In Press)
🎧 + 🎥 Audio-Visual Multimodal Fusion Framework
🧠 Combines CNN, SE Networks, and R(2+1)D
📊 Achieves state-of-the-art performance on:
- RAVDESS
- SAVEE
- CREMA-D
⚡ Pretrained models available via torch.hub

🧩 Overview

Emotion recognition from speech is inherently multimodal, involving both acoustic signals and facial expressions.

This repository presents a deep multimodal architecture that:

Extracts spatiotemporal features from video using R(2+1)D
Processes acoustic features using 1D CNN + Squeeze-and-Excitation (SE)
Learns a joint embedding space via fusion
Performs robust emotion classification

🏗️ Architecture

Video → R(2+1)D Encoder ┐
                        ├──► Fusion ──► Classifier (Softmax)
Audio → 1D CNN + SE ────┘

Key Components

Video Encoder
- Pretrained R(2+1)D
- Captures spatial + temporal dependencies
Audio Encoder
- Features:
  - MFCC
  - Mel-Spectrogram
  - RMS Energy
  - Zero Crossing Rate
- Enhanced with Squeeze-and-Excitation (SE)
Fusion
- Multimodal feature aggregation

📊 Results

Dataset	F1-Score
RAVDESS	0.9711
SAVEE	0.9965
CREMA-D	0.8990

⚙️ Installation

1. Clone Repository

git clone https://github.com/S-M-J-I/Multimodal-Emotion-Recognition
cd Multimodal-Emotion-Recognition

Install Dependencies

pip3 install --user pipenv
pipenv install -r requirements.txt

🚀 Quick Start (Pretrained Model)

Load model directly using PyTorch Hub:

import torch

model = torch.hub.load(
    "S-M-J-I/Multimodal-SER",
    "savee_model",
    pretrained=True,
    num_classes=7,
    fine_tune_limit=3
)

✔ No need to define architecture manually

📂 Dataset Setup

Download datasets:

RAVDESS
SAVEE
CREMA-D

Place them locally and update paths inside notebooks.

⚠️ Important: /path_to_dataset/ (Ensure trailing /)

🧪 Running Experiments

pipenv run jupyter notebook

Navigate to explore/. Each notebook includes:

Data preprocessing
Feature extraction
Training & evaluation

🧠 Technical Contributions

✅ Multimodal fusion of audio + video

✅ Use of R(2+1)D for emotion-aware video encoding

✅ Channel recalibration via Squeeze-and-Excitation

✅ Robust performance across multiple datasets

✅ Lightweight and reproducible pipeline

If you use this work, please cite:

@article{ISLAM2026115762,
  title = {An audio-video based multi-modal fusion approach for emotion recognition},
  journal = {Knowledge-Based Systems},
  pages = {115762},
  year = {2026},
  issn = {0950-7051},
  doi = {https://doi.org/10.1016/j.knosys.2026.115762},
  url = {https://www.sciencedirect.com/science/article/pii/S0950705126004909},
  author = {S M Jishanul Islam and Sahid Hossain Mustakim and Musfirat Hossain and Mysun Mashira and Nur Islam Shourav and Md Rayhan Ahmed and Salekul Islam and A. K M Muzahidul Islam and Swakkhar Shatabda}, keywords = {Deep Learning, Multimodal Deep Learning, Convolutional neural network, Squeeze-and-Excitation network, Frame Filtering, Emotion recognition},
  abstract = {Multi-modal deep learning leverages complementary information from multiple data sources, such as audio and video, to improve the robustness of emotion recognition systems. However, effectively integrating heterogeneous features across modalities remains challenging. In this paper, we propose a multi-modal architecture that jointly models audio and visual information for emotion classification. The framework employs a pre-trained R(2+1)D network to extract spatiotemporal representations from video inputs through the combination of 2D spatial and 1D temporal convolutions within a residual learning structure. For the audio modality, multiple acoustic representations—including root-mean-square energy, zero-crossing rate, mel-frequency cepstral coefficients, and mel-spectrogram features averaged along the temporal dimension—are processed using dedicated 1D convolutional networks augmented with a squeeze-and-excitation module for channel-wise feature recalibration. The extracted audio and video features are fused and passed to a Softmax classifier for emotion prediction. The proposed approach is evaluated on three benchmark datasets: the Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS), the Surrey Audio–Visual Expressed Emotion (SAVEE), and the Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D). Experimental results demonstrate strong performance, achieving F1-scores of 0.9711, 0.9965, and 0.8990 on RAVDESS, SAVEE, and CREMA-D, respectively, showing competitive performance with recent state-of-the-art methods. The code and trained models are publicly available at https://github.com/S-M-J-I/Multimodal-SER.}
}

⭐ Support & Community

If you find this repository useful: ⭐ Star this repo 🍴 Fork for your research 🧾 Cite our paper 🛠️ Issues

🔒 Contribution Policy

This repository is not open for external code contributions.

🌐 Links

📄 Paper: https://www.sciencedirect.com/science/article/pii/S0950705126004909

💻 Code: https://github.com/S-M-J-I/Multimodal-SER

Built for research. Designed for reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
cloud_notebooks		cloud_notebooks
explore		explore
src		src
weights		weights
.gitattributes		.gitattributes
.gitignore		.gitignore
AlernateInstructions.md		AlernateInstructions.md
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
__init__.py		__init__.py
hubconf.py		hubconf.py
load_test.py		load_test.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Emotion Recognition (Audio-Visual Fusion)

🔥 Highlights

🧩 Overview

🏗️ Architecture

Key Components

📊 Results

⚙️ Installation

1. Clone Repository

🚀 Quick Start (Pretrained Model)

📂 Dataset Setup

🧪 Running Experiments

🧠 Technical Contributions

⭐ Support & Community

🔒 Contribution Policy

🌐 Links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Emotion Recognition (Audio-Visual Fusion)

🔥 Highlights

🧩 Overview

🏗️ Architecture

Key Components

📊 Results

⚙️ Installation

1. Clone Repository

🚀 Quick Start (Pretrained Model)

📂 Dataset Setup

🧪 Running Experiments

🧠 Technical Contributions

⭐ Support & Community

🔒 Contribution Policy

🌐 Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages