This project contains a process for training polyp classification models using images from the PIBAdb database.
The pipeline performs the following main tasks:
- Generation of polyp-level stratified k-fold partitions
- Optional inclusion of localized images of the same polyp to increase training data (automatically detected frames containing the same polyp, cropped using the bounding box metadata)
- Oversampling of minority classes in the training split
- Training of ImageNet-pretrained CNN models using GluonCV
- Automatic generation of metrics, plots, and HTML reports
To use the pipeline, you must request access to the PIBAdb Cohort through the Biobank of the Instituto de Investigaciones Sanitarias Galicia Sur. Images for polyp classification must be cropped according to the bounding boxes in metadata.csv and placed in the experiments/datasets folder.
The required fields in the metadata.csv file are:
polyp_id: Unique identifier of the polypimage_id: Identifier of the source imagex,y: Top-left corner of the bounding boxwidth,height: Dimensions of the bounding boxlight: Illumination type (WL = White Light, NBI = Narrow-Band Imaging)condition: Histological type of the polyp (e.g., adenoma, hyperplastic, sessile serrated adenoma, etc.)
Based on Nogueira-Rodríguez et al., 2022:
Nogueira Rodríguez, A.; Daniel González, D.; Hugo López Fernández, P.D. Deep Learning Techniques for Computer-Aided Diagnosis in Colorectal Cancer. University of Vigo, 2022.
- Adenoma = Adenoma, Sessile Serrated Adenoma (SSA), Traditional Serrated Adenoma (TSA)
- Hyperplastic = Hyperplastic
Images from other histological categories were excluded (Non-Epithelial Neoplastic, Invasive, No Histology).
The datasets differ in how the images were collected and whether additional automatically detected frames are included.
| Dataset | Description | #Polyps |
|---|---|---|
| D1 | Manual selection from full withdrawal videos | 358 |
| D2 | D1 + automatically localized images (training only) | 358 |
| D3 | Manual selection of large/close polyps from new videos focused on the classification task | 129 |
| D4 | Combination D1 + D3 | 487 |
| D5 | Combination D2 + D3 | 487 |
Note: The dataset names D1, D2, D3, D4, and D5 are labels used in the thesis for clarity and do not correspond to actual folders or dataset identifiers in this repository.
To reproduce the experiments, you must prepare your own dataset folders inside experiments/datasets/ and reference them through the dataset_name parameter in pipeline.params.
All models are pretrained on ImageNet:
- ResNet50
- VGG19
- InceptionV
The training process consists of:
- 150 epochs
- Optional oversampling of minority classes in the training partition
- 5-fold stratified cross-validation at the polyp level (images from the same polyp cannot appear in different partitions)
- Docker
- Access to PIBAdb datasets (via request to the PIBAdb Cohort)
The pipeline runs inside a Docker container to ensure reproducibility. To build the Docker image:
docker build -t polydeep/classification .It is recommended to organize datasets and experiments as follows:
/classification-project
│
├── experiments/
│ ├── datasets/
│ │ └── cropped_polyps_dataset/
│ │ ├── images/
│ │ ├── metadata.csv
│ │ └── polyp-metadata.csv
│ ├── Exp_K1_CNN/
│ │ ├── pipeline.params
│ │ └── ...
│ ├── Exp_K2_CNN/
│ │ ├── pipeline.params
│ │ └── ...
│ └── ...
└── ...Note: The folder experiments/datasets/ contains all datasets used by the different experiments.
Note: Each experiment folder (Exp_K1_CNN, Exp_K2_CNN, etc.) contains a pipeline.params file specifying its configuration. Example:
dataset_name=cropped_polyps_dataset
model_name=vgg19
num_gpus=1
epochs=150
kfolds=5
fold=1
with_balanced_train
batch_size=96
learning_rate=0.00001To run an experiment:
docker run --rm \
-v /local/path/to/experiment:/experiment \
polydeep/classification \
-p /pipeline.xml \
--params /experiment/pipeline.paramsWhere:
-
-v /local/path/to/experiment:/experimentmounts the local experiment folder inside the container. -
-p /pipeline.xmlspecifies the pipeline to run (already included in the container). -
--params /experiment/pipeline.paramsprovides the experiment configuration.
| Flag | Description |
|---|---|
| with_balanced_train | Enables oversampling in the training set |
| with_localized_images | Adds additional automatically detected images of the same polyp (localized images) from another dataset exclusively to the training set |
| with_merge_ttv | Merges train/val partitions from two prior experiments |
The experiments described in the thesis fall into three categories, depending on dataset composition and training strategy. This section describes how to reproduce each one from this repository using the Compi pipeline.
Applies to:
- D1 – PIBAdb Manually Selected NBI
- D3 – PIBAdb Manually Selected for Classification NBI
In the experiment folder (e.g., Exp_K1_ResNet50/), the pipeline.params file must specify the dataset name. For example, if we name the dataset D1 or D3, it would be:
dataset_name = D1 # or D3Run with:
docker run --rm \
-v /local/path/to/experiment:/experiment \
polydeep/classification \
-p /pipeline.xml \
--params /experiment/pipeline.paramsThese experiments include additional images of the same polyps obtained from a secondary dataset (e.g., “PIBAdb Automatically Selected NBI”) and include them only in the training split.
Applicable to:
- D2 – D1 + localized images (training only)
dataset_name = D2
with_localized_imagesRun with:
docker run --rm \
-v /local/path/to/dataset:/datasets-localized \
-v /local/path/to/experiment:/experiment \
polydeep/classification \
-p /pipeline.xml \
--params /experiment/pipeline.paramsThese experiments require merging partitions from two distinct preliminary experiments.
Steps:
- Create two baseline experiments. For example:
- Exp_K1_D1_ResNet50
- Exp_K1_D3_ResNet50
- Enable
with_merge_ttvto merge thetrainandvalpartitions before training.
exp1=Exp_K1_D1_ResNet50
exp2=Exp_K1_D3_ResNet50
model_name=resnet50
epochs=150
kfolds=5
fold=1
with_merge_ttvAfter execution, the experiment directory contains:
-
results/: Training and validation metrics (CSV format) -
summaries/: Performance and loss plots (PNG format) -
index.html: Summary report with all metrics and plots -
ttv/:train/andvalidation/folders with metadata and image list
Note: Results are stored inside the mounted experiment folder within the container.