This project focuses on classifying facial emotions using a fine-tuned Vision Transformer (ViT) model.
We used the ViT-Base-Patch16-224 pretrained model and adapted its classification head to predict seven emotions from grayscale face images.
The work was done as part of a team project. This repository contains my contribution to the ViT model, including data augmentation, model training, and evaluation.
We used the FER 2013 dataset on Kaggle, which contains 35,685 grayscale images (48×48 pixels), categorized into seven emotions:
- 😠 angry
- 🤢 disgusted
- 😨 fearful
- 😀 happy
- 😐 neutral
- 😢 sad
- 😲 surprised
The dataset is split into training and test sets. The disgusted class was underrepresented and required augmentation (see below).
- Imbalanced Classes: e.g.,
disgustedwas underrepresented
→ Resolved via targeted augmentation (rotation, flips) - Low Image Quality: Some images were blurry or contained text artifacts
- Similar Emotions: Even for humans, emotions like fearful and sad are hard to distinguish
-
ViT Model: Google ViT-Base-Patch16-224
-
Libraries: PyTorch · Hugging Face Transformers
-
Training Techniques:
- Data Augmentation (flips, rotation)
- Weight Decay Regularization
- Early Stopping · Learning Rate Scheduler
- Cross-Entropy Loss
- Self-Attention Mechanism
This matrix shows the model’s performance on the test set (normalized):
The final poster summarizes all three models (CNN, Transfer Learning & ViT), their performance, challenges, and findings:
This project was created as part of a university project at FHNW and is intended for demonstration and educational purposes only.
You can view and download the original data here:
➡️ Kaggle – Facial Emotion Recognition (FER2013)

