The goal of this project was to create a 3D Convolutional Neural Network (CNN) using only binary masks to classify different types of exercises, performed with gum ribbon, based on video data. The dataset used for this project was given by the AGH University and consisted of anonymized video recordings of individuals performing various exercises.
The project was done by a team of three members:
In the project we compared three different approaches to the problem:
- Detection based only on the binary mask of the person performing the exercise
- Detection based on the binary mask of the ribbon
- Detection based on the binary mask of both the person and the ribbon
The given dataset contained video recording of 44 individuals performing 16 different exercises. Each video frame was labeled with the corresponding exercise type - no exercise was labeled as 0, while the exercises were labeled from 1 to 16.
The video data was preprocessed to extract frames and convert them into short clips. Because of the high computational
cost of training, we decided to use only seven frames per clip, selected at regular intervals (frame_skip) of five,
with a stride of 5 to create only little overlap between clips.
The samples were generated using a generate_samples.py script.
It resulted of about 12000 clips, with 80%-20% split for training and validation sets.
During samples generation we generated binary masks for the person and the ribbon. The person was detected using pre-trained Yolo26s model, while the ribbon was detected using a fine-tuned version of the same model.
Disclaimer: This step could be done using the same modele, however, due to lack of time, we decided to use two separate models.
The ribbon detection model was trained on a small dataset of manually labeled frames (around 300 samples) using Label Studio.
The resulting model segmented the ribbon with decent accuracy, although it was not perfect. Sometimes the ribbon was detected only partially, sometimes it was not detected at all. The person detection model performed well on the given data.
The model used for classification was a 3D Convolutional Neural Network (CNN) implemented in PyTorch. The architecture three convolutional layers followed by a fully connected layer and a softmax output layer. The model was trained using the Adam optimizer and cross-entropy loss function.
Depending on the settings, the model consisted of either one or two input channels, corresponding to the binary masks of the person and the ribbon.
The model was trained for 50 epochs with a batch size of 32 on each of the three approaches (person mask, ribbon mask, and combined masks). The training process was monitored using Weights & Biases (W&B) to track the loss and accuracy metrics.
The results showed that even a simple model like this was able to achieve decent performance, around 0.8 F1-score, on all three settings.
Even though the ribbon detection model was not perfect, the best performance was achieved when using the combined masks of both the person and the ribbon. This suggests that both masks contain meaningful information for the exercise detection task.
On the other hand, because all the samples in the dataset were relatively similar and overall class imbalance (class 0 was overrepresented), the model was prone to overfitting. This was partially resolved by using a high dropout rate ( 0.5), L2 weight regularization and class weighting of the cross entropy loss function. It could be further improved by using data augmentation.

