Skip to content

sing-group/pibadb-classification

Repository files navigation

Polyp Classification with PIBAdb Images

This project contains a process for training polyp classification models using images from the PIBAdb database.
The pipeline performs the following main tasks:

  • Generation of polyp-level stratified k-fold partitions
  • Optional inclusion of localized images of the same polyp to increase training data (automatically detected frames containing the same polyp, cropped using the bounding box metadata)
  • Oversampling of minority classes in the training split
  • Training of ImageNet-pretrained CNN models using GluonCV
  • Automatic generation of metrics, plots, and HTML reports

1. PIBAdb datasets

To use the pipeline, you must request access to the PIBAdb Cohort through the Biobank of the Instituto de Investigaciones Sanitarias Galicia Sur. Images for polyp classification must be cropped according to the bounding boxes in metadata.csv and placed in the experiments/datasets folder.

The required fields in the metadata.csv file are:

  • polyp_id: Unique identifier of the polyp
  • image_id: Identifier of the source image
  • x, y: Top-left corner of the bounding box
  • width, height: Dimensions of the bounding box
  • light: Illumination type (WL = White Light, NBI = Narrow-Band Imaging)
  • condition: Histological type of the polyp (e.g., adenoma, hyperplastic, sessile serrated adenoma, etc.)

2. Thesis Experiments

Based on Nogueira-Rodríguez et al., 2022:

Nogueira Rodríguez, A.; Daniel González, D.; Hugo López Fernández, P.D. Deep Learning Techniques for Computer-Aided Diagnosis in Colorectal Cancer. University of Vigo, 2022.

2.1 Classification classes

  • Adenoma = Adenoma, Sessile Serrated Adenoma (SSA), Traditional Serrated Adenoma (TSA)
  • Hyperplastic = Hyperplastic

Images from other histological categories were excluded (Non-Epithelial Neoplastic, Invasive, No Histology).

2.2 Experiment Datasets

The datasets differ in how the images were collected and whether additional automatically detected frames are included.

Dataset Description #Polyps
D1 Manual selection from full withdrawal videos 358
D2 D1 + automatically localized images (training only) 358
D3 Manual selection of large/close polyps from new videos focused on the classification task 129
D4 Combination D1 + D3 487
D5 Combination D2 + D3 487

Note: The dataset names D1, D2, D3, D4, and D5 are labels used in the thesis for clarity and do not correspond to actual folders or dataset identifiers in this repository. To reproduce the experiments, you must prepare your own dataset folders inside experiments/datasets/ and reference them through the dataset_name parameter in pipeline.params.

2.3 CNN Architectures

All models are pretrained on ImageNet:

  • ResNet50
  • VGG19
  • InceptionV

2.4 Training Strategy

The training process consists of:

  • 150 epochs
  • Optional oversampling of minority classes in the training partition
  • 5-fold stratified cross-validation at the polyp level (images from the same polyp cannot appear in different partitions)

3. Environment Setup

3.1 Requirements

  • Docker
  • Access to PIBAdb datasets (via request to the PIBAdb Cohort)

3.2 Build the Docker image

The pipeline runs inside a Docker container to ensure reproducibility. To build the Docker image:

docker build -t polydeep/classification .

4. Project Structure

It is recommended to organize datasets and experiments as follows:

/classification-project
│
├── experiments/
│   ├── datasets/
│   │   └── cropped_polyps_dataset/
│   │       ├── images/
│   │       ├── metadata.csv
│   │       └── polyp-metadata.csv
│   ├── Exp_K1_CNN/
│   │   ├── pipeline.params
│   │   └── ...
│   ├── Exp_K2_CNN/
│   │   ├── pipeline.params
│   │   └── ...
│   └── ...
└── ...

Note: The folder experiments/datasets/ contains all datasets used by the different experiments.

Note: Each experiment folder (Exp_K1_CNN, Exp_K2_CNN, etc.) contains a pipeline.params file specifying its configuration. Example:

dataset_name=cropped_polyps_dataset
model_name=vgg19
num_gpus=1
epochs=150
kfolds=5
fold=1
with_balanced_train
batch_size=96
learning_rate=0.00001

5. Basic Pipeline Execution

To run an experiment:

docker run --rm \
    -v /local/path/to/experiment:/experiment \
    polydeep/classification \
    -p /pipeline.xml \
    --params /experiment/pipeline.params

Where:

  • -v /local/path/to/experiment:/experiment mounts the local experiment folder inside the container.

  • -p /pipeline.xml specifies the pipeline to run (already included in the container).

  • --params /experiment/pipeline.params provides the experiment configuration.

Optional flags in pipeline.params

Flag Description
with_balanced_train Enables oversampling in the training set
with_localized_images Adds additional automatically detected images of the same polyp (localized images) from another dataset exclusively to the training set
with_merge_ttv Merges train/val partitions from two prior experiments

6. Running the Thesis Experiments

The experiments described in the thesis fall into three categories, depending on dataset composition and training strategy. This section describes how to reproduce each one from this repository using the Compi pipeline.

6.1. Standard Dataset (No Additional Images)

Applies to:

  • D1 – PIBAdb Manually Selected NBI
  • D3 – PIBAdb Manually Selected for Classification NBI

Example configuration

In the experiment folder (e.g., Exp_K1_ResNet50/), the pipeline.params file must specify the dataset name. For example, if we name the dataset D1 or D3, it would be:

dataset_name = D1     # or D3

Run with:

docker run --rm \
    -v /local/path/to/experiment:/experiment \
    polydeep/classification \
    -p /pipeline.xml \
    --params /experiment/pipeline.params

6.2 Experiments augmented with additional images during training

These experiments include additional images of the same polyps obtained from a secondary dataset (e.g., “PIBAdb Automatically Selected NBI”) and include them only in the training split.

Applicable to:

  • D2 – D1 + localized images (training only)

Example configuration

dataset_name = D2
with_localized_images

Run with:

docker run --rm \
    -v /local/path/to/dataset:/datasets-localized \
    -v /local/path/to/experiment:/experiment \
    polydeep/classification \
    -p /pipeline.xml \
    --params /experiment/pipeline.params

6.3 Combined Datasets

These experiments require merging partitions from two distinct preliminary experiments.

Steps:

  1. Create two baseline experiments. For example:
    • Exp_K1_D1_ResNet50
    • Exp_K1_D3_ResNet50
  • Enable with_merge_ttv to merge the train and val partitions before training.

Example configuration

exp1=Exp_K1_D1_ResNet50
exp2=Exp_K1_D3_ResNet50
model_name=resnet50
epochs=150
kfolds=5
fold=1
with_merge_ttv

7. Generated Outputs

After execution, the experiment directory contains:

  • results/: Training and validation metrics (CSV format)

  • summaries/: Performance and loss plots (PNG format)

  • index.html: Summary report with all metrics and plots

  • ttv/: train/ and validation/ folders with metadata and image list

Note: Results are stored inside the mounted experiment folder within the container.

About

PIBAdb classification pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published