Hybrid CNN–Vision Transformer Architecture for Multi-Class Skin Lesion Classification

Project Details

This project investigates the effectiveness of combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for multi-class skin lesion classification using the publicly available HAM10000 dermoscopic image dataset.

CNNs are well-suited for capturing local texture patterns, while Vision Transformers model global spatial relationships through self-attention. To leverage the complementary strengths of both architectures, this project evaluates:

a CNN baseline,
a ViT baseline,
and a hybrid CNN–ViT late-fusion model.

Due to the severe class imbalance present in the dataset, class-weighted cross-entropy loss is applied, and macro-averaged F1-score is used as the primary evaluation metric.

Objectives

Perform multi-class skin lesion classification (7 classes)
Compare CNN and ViT architectures under class imbalance
Design and evaluate a hybrid CNN–ViT model
Analyze performance using Macro-F1 and confusion matrices
Ensure full reproducibility using a public dataset

Dataset

HAM10000 – Skin Cancer MNIST

~10,000 dermoscopic RGB images
7 diagnostic classes:
- akiec – Actinic keratoses
- bcc – Basal cell carcinoma
- bkl – Benign keratosis
- df – Dermatofibroma
- mel – Melanoma
- nv – Melanocytic nevus
- vasc – Vascular lesions
Source: Kaggle (ISIC Archive)

Only image files and diagnosis labels (dx) are used. All other metadata fields are ignored. Data split: 70% train / 15% validation / 15% test (stratified)

Project Architecture

Models

CNN Baseline: EfficientNet-B0
ViT Baseline: Vision Transformer (ViT-Base / DeiT-Small)
Hybrid Model:
- CNN branch → local texture features
- ViT branch → global contextual features
- Late fusion via feature concatenation
- MLP classifier head

Input

Image size: 224 × 224
RGB images
ImageNet normalization

Class Imbalance Handling

The HAM10000 dataset is highly imbalanced.

To address this:

Class-weighted cross-entropy loss is applied
Macro-F1 score is used as the primary evaluation metric

My Findings

CNNs → better on medium-sized medical datasets, and ViTs → need more data or longer training

# The HAM10000 dataset is highly imbalanced.
# To address class imbalance, class-weighted cross-entropy loss is applied
# so that minority classes contribute more strongly to the training objective.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
cnn-vision-transformer.ipynb		cnn-vision-transformer.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid CNN–Vision Transformer Architecture for Multi-Class Skin Lesion Classification

Project Details

Objectives

Dataset

Project Architecture

Models

Input

Class Imbalance Handling

My Findings

About

Uh oh!

Releases

Packages

Languages

rashakil-ds/CNN-Vision-Transformer-for-Classification

Folders and files

Latest commit

History

Repository files navigation

Hybrid CNN–Vision Transformer Architecture for Multi-Class Skin Lesion Classification

Project Details

Objectives

Dataset

Project Architecture

Models

Input

Class Imbalance Handling

My Findings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages