This project investigates the effectiveness of combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for multi-class skin lesion classification using the publicly available HAM10000 dermoscopic image dataset.
CNNs are well-suited for capturing local texture patterns, while Vision Transformers model global spatial relationships through self-attention. To leverage the complementary strengths of both architectures, this project evaluates:
- a CNN baseline,
- a ViT baseline,
- and a hybrid CNN–ViT late-fusion model.
Due to the severe class imbalance present in the dataset, class-weighted cross-entropy loss is applied, and macro-averaged F1-score is used as the primary evaluation metric.
- Perform multi-class skin lesion classification (7 classes)
- Compare CNN and ViT architectures under class imbalance
- Design and evaluate a hybrid CNN–ViT model
- Analyze performance using Macro-F1 and confusion matrices
- Ensure full reproducibility using a public dataset
HAM10000 – Skin Cancer MNIST
- ~10,000 dermoscopic RGB images
- 7 diagnostic classes:
akiec– Actinic keratosesbcc– Basal cell carcinomabkl– Benign keratosisdf– Dermatofibromamel– Melanomanv– Melanocytic nevusvasc– Vascular lesions
- Source: Kaggle (ISIC Archive)
Only image files and diagnosis labels (
dx) are used. All other metadata fields are ignored. Data split: 70% train / 15% validation / 15% test (stratified)
- CNN Baseline: EfficientNet-B0
- ViT Baseline: Vision Transformer (ViT-Base / DeiT-Small)
- Hybrid Model:
- CNN branch → local texture features
- ViT branch → global contextual features
- Late fusion via feature concatenation
- MLP classifier head
- Image size:
224 × 224 - RGB images
- ImageNet normalization
The HAM10000 dataset is highly imbalanced.
To address this:
- Class-weighted cross-entropy loss is applied
- Macro-F1 score is used as the primary evaluation metric
CNNs → better on medium-sized medical datasets, and ViTs → need more data or longer training
# The HAM10000 dataset is highly imbalanced.
# To address class imbalance, class-weighted cross-entropy loss is applied
# so that minority classes contribute more strongly to the training objective.