Skip to content

This project investigates the effectiveness of combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for multi-class skin lesion classification using the publicly available HAM10000 dermoscopic image dataset.

Notifications You must be signed in to change notification settings

rashakil-ds/CNN-Vision-Transformer-for-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Hybrid CNN–Vision Transformer Architecture for Multi-Class Skin Lesion Classification

Project Details

This project investigates the effectiveness of combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for multi-class skin lesion classification using the publicly available HAM10000 dermoscopic image dataset.

CNNs are well-suited for capturing local texture patterns, while Vision Transformers model global spatial relationships through self-attention. To leverage the complementary strengths of both architectures, this project evaluates:

  • a CNN baseline,
  • a ViT baseline,
  • and a hybrid CNN–ViT late-fusion model.

Due to the severe class imbalance present in the dataset, class-weighted cross-entropy loss is applied, and macro-averaged F1-score is used as the primary evaluation metric.


Objectives

  • Perform multi-class skin lesion classification (7 classes)
  • Compare CNN and ViT architectures under class imbalance
  • Design and evaluate a hybrid CNN–ViT model
  • Analyze performance using Macro-F1 and confusion matrices
  • Ensure full reproducibility using a public dataset

Dataset

HAM10000 – Skin Cancer MNIST

  • ~10,000 dermoscopic RGB images
  • 7 diagnostic classes:
    • akiec – Actinic keratoses
    • bcc – Basal cell carcinoma
    • bkl – Benign keratosis
    • df – Dermatofibroma
    • mel – Melanoma
    • nv – Melanocytic nevus
    • vasc – Vascular lesions
  • Source: Kaggle (ISIC Archive)

Only image files and diagnosis labels (dx) are used. All other metadata fields are ignored. Data split: 70% train / 15% validation / 15% test (stratified)


Project Architecture

Models

  • CNN Baseline: EfficientNet-B0
  • ViT Baseline: Vision Transformer (ViT-Base / DeiT-Small)
  • Hybrid Model:
    • CNN branch → local texture features
    • ViT branch → global contextual features
    • Late fusion via feature concatenation
    • MLP classifier head

Input

  • Image size: 224 × 224
  • RGB images
  • ImageNet normalization

Class Imbalance Handling

The HAM10000 dataset is highly imbalanced.

To address this:

  • Class-weighted cross-entropy loss is applied
  • Macro-F1 score is used as the primary evaluation metric

My Findings

CNNs → better on medium-sized medical datasets, and ViTs → need more data or longer training

# The HAM10000 dataset is highly imbalanced.
# To address class imbalance, class-weighted cross-entropy loss is applied
# so that minority classes contribute more strongly to the training objective.

About

This project investigates the effectiveness of combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for multi-class skin lesion classification using the publicly available HAM10000 dermoscopic image dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published