feat: Added evaluation and readme

Karol-G · Karol-G · commit 8c32fcb73678 · 2025-06-27T18:35:23.000+02:00
diff --git a/README.md b/README.md
@@ -1,49 +1,124 @@
 # ScribbleBench
 
 [![License Apache Software License 2.0](https://img.shields.io/pypi/l/ScribbleBench.svg?color=green)](https://github.com/Karol-G/ScribbleBench/raw/main/LICENSE)
-[![PyPI](https://img.shields.io/pypi/v/ScribbleBench.svg?color=green)](https://pypi.org/project/ScribbleBench)
 [![Python Version](https://img.shields.io/pypi/pyversions/ScribbleBench.svg?color=green)](https://python.org)
-[![tests](https://github.com/Karol-G/ScribbleBench/workflows/tests/badge.svg)](https://github.com/Karol-G/ScribbleBench/actions)
-![Unit Tests](https://github.com/Karol-G/ScribbleBench/actions/workflows/test_and_deploy.yml/badge.svg?branch=main)
-[![codecov](https://codecov.io/gh/Karol-G/ScribbleBench/branch/main/graph/badge.svg)](https://codecov.io/gh/Karol-G/ScribbleBench)
 
-Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation
+**ScribbleBench** is a comprehensive benchmark for evaluating the generalization capabilities of 3D scribble-supervised medical image segmentation methods. It spans seven diverse datasets across multiple anatomies and modalities and provides realistic, automatically generated scribble annotations.
 
-----------------------------------
+This repository provides:
+- A guide on how to setup the ScribbleBench benchmark using the original dataset sources and our ScribbleBench scribbles.
+- Our scribble generation code to create realistic interior and boundary scribbles heuristics.
+- An evaluation script to evaluate your method using ScribbleBench.
+- A reference to our scribble baseline nnnUNet+pL
+- A scribble annotation protocol for domain experts that can be used as guidance to quickly annotate new datasets manually.
 
-Project description...
+ScribbleBench was introduced in our MICCAI 2025 paper:  
+**“Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation”**  
+Authors: Karol Gotkowski, Klaus H. Maier-Hein, Fabian Isensee
 
-## Installation
 
-You can install `ScribbleBench` via [pip](https://pypi.org/project/ScribbleBench/):
+## 📦 Benchmark Setup
 
-    pip install ScribbleBench
+ScribbleBench includes scribbles for the following 7 public datasets:
+- ACDC
+- MSCMR
+- WORD
+- AMOS2022 (Task2)
+- KiTS23
+- LiTS
+- BraTS2020
 
+### 📥 Download Datasets
 
+TODO
 
 
-## Contributing
+## 🛠️ Scribble Generation
 
-Contributions are very welcome. Tests can be run with [tox], please ensure
-the coverage at least stays the same before you submit a pull request.
+You can use our script to generate scribbles for your own 3D medical segmentation datasets. The script supports:
+- Interior scribbles using NURBS curves.
+- Boundary scribbles based on partial contours.
+- Foreground/background slice balancing.
+- Multiprocessing for efficient processing of large datasets.
 
-## License
+### 🚀 Run Scribble Generation
 
-Distributed under the terms of the [Apache Software License 2.0] license,
-"ScribbleBench" is free and open source software
+```bash
+python generate_scribbles.py \
+  --input path/to/dense_segmentations \
+  --output path/to/save_scribbles \
+  --num_labels 4 \
+  --conf scribble_conf.yml \
+  --processes 8
+```
 
-## Issues
+**Optional arguments:**
 
-If you encounter any problems, please file an issue along with a detailed description.
+* `--name` → specify one or more file names to process (omit `.nii.gz`)
+* `--disable_ignore` → disables marking unlabeled voxels with an ignore label
 
-[Cookiecutter]: https://github.com/audreyr/cookiecutter
-[MIT]: http://opensource.org/licenses/MIT
-[BSD-3]: http://opensource.org/licenses/BSD-3-Clause
-[GNU GPL v3.0]: http://www.gnu.org/licenses/gpl-3.0.txt
-[GNU LGPL v3.0]: http://www.gnu.org/licenses/lgpl-3.0.txt
-[Apache Software License 2.0]: http://www.apache.org/licenses/LICENSE-2.0
-[Mozilla Public License 2.0]: https://www.mozilla.org/media/MPL/2.0/index.txt
+## 📊 Evaluation
 
-[tox]: https://tox.readthedocs.io/en/latest/
-[pip]: https://pypi.org/project/pip/
-[PyPI]: https://pypi.org/
+You can evaluate your segmentation predictions using the provided script:
+
+```bash
+python evaluation.py \
+  --gt_dir path/to/ground_truth \
+  --pred_dir path/to/predictions \
+  --num_labels 4 \
+  --processes 8
+```
+
+## 🛠️ Scribble Baseline nnUNet+pL
+
+Our scribble baseline nnUNet+pL is implemented in the [nnU-Net](https://github.com/MIC-DKFZ/nnUNet) framework itself. It is there referred to as "ignore label" and is described [here](https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/ignore_label.md).
+
+## 📋 Scribble Annotation Protocol
+
+You can also manually create your own scribbles for new datasets by following this lightweight annotation protocol. These human-created scribbles can be used directly to train a model using the same methods as with automatically generated ones.
+
+### ✏️ Instructions
+
+Given a 3D image **I** in your dataset:
+- For each axial slice **S** in **I**:
+  - For each class **C** present in slice **S**:
+    - Select a single **connected component (CC)** of class **C** in **S**
+    - For that component **CC**, draw:
+      - One **interior scribble**
+      - One **boundary scribble**
+
+Note: Do not ignore the background class! Also include a good number of pure background slices.
+
+#### 🟢 Interior Scribble
+- Must be drawn **inside the component CC**.
+- Should be placed roughly **in and around the center area** of the component.
+- Ideal length is **comparable to the diameter or extent** of the component.
+- Can be any arbitrary shape (straight, curved, etc.) as long as it lies **fully within the component**.
+
+#### 🔵 Boundary Scribble
+- Should trace **a portion (15%–100%)** of the **inner boundary** of the component CC.
+- Should ideally follow the actual boundary as closely as possible.
+- A **1–3 voxel inward offset** is acceptable, but **closer to the true boundary is better**.
+- This scribble helps the model capture **boundary details** during learning.
+
+Following this protocol allows quick and efficient labeling of 3D datasets using just a few sparse lines per class and slice, while maintaining strong training performance.
+
+
+
+## 📄 Citation
+
+If you use ScribbleBench or our scribble generation code, please cite:
+
+```bibtex
+@inproceedings{gotkowski2025scribblebench,
+  title     = {Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation},
+  author    = {Karol Gotkowski and Klaus H. Maier-Hein and Fabian Isensee},
+  booktitle = {International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)},
+  year      = {2025}
+}
+```
+
+
+## 📬 Contact
+
+For questions, suggestions, or contributions, feel free to open an issue or contact [karol.gotkowski@dkfz.de](mailto:karol.gotkowski@dkfz.de).
diff --git a/scribblebench/evaluation.py b/scribblebench/evaluation.py
@@ -0,0 +1,84 @@
+import numpy as np
+import os
+from tqdmp import tqdmp
+from pathlib import Path
+import argparse
+from medvol import MedVol
+
+
+def evaluate(gt_dir, pred_dir, num_classes, num_processes=None):    
+    gt_dir = Path(gt_dir)
+    pred_dir = Path(pred_dir)
+    names_gt = [path.name[:-7] for path in Path(gt_dir).rglob("*.nii.gz")]
+    names_pred = [path.name[:-7] for path in Path(pred_dir).rglob("*.nii.gz")]
+
+    if set(names_gt) != set(names_pred):
+        raise RuntimeError(f"The set of GT segmentations is different to the set of predictions. Do you have missing predictions?")
+
+    if isinstance(num_processes, str):
+        num_processes = int(num_processes)
+
+    dice_scores = tqdmp(evaluate_prediction, names_gt, num_processes, gt_dir=gt_dir, pred_dir=pred_dir, num_classes=num_classes, desc="Evaluating")
+
+    mean_dice_score = float(np.mean(dice_scores))
+
+    print(f"Mean Dice Score: {mean_dice_score}")
+        
+
+def evaluate_prediction(name, gt_dir, pred_dir, num_classes, foreground_only=True):
+    gt_filepath = gt_dir / f"{name}.nii.gz"
+    pred_filepath = pred_dir / f"{name}.nii.gz"
+    if not os.path.exists(pred_filepath):
+        raise RuntimeError(f"Prediction ({name}) does not exist.")
+    gt = MedVol(str(gt_filepath)).array
+    pred = MedVol(str(pred_filepath)).array
+    gt = np.rint(np.asarray(gt)).astype(np.uint8)
+    pred = np.rint(np.asarray(pred)).astype(np.uint8)
+    if gt.shape != pred.shape:
+        raise RuntimeError("Prediction and GT do not have the same shape.")
+    gt = gt.flatten()
+    pred = pred.flatten()
+    dice_score = comp_dice(pred, gt, num_classes, foreground_only)
+    return dice_score
+
+
+def comp_dice(pred, gt, num_classes, foreground_only=True, ignore_mask=None):
+    class_labels = list(range(num_classes))
+    if foreground_only:
+        class_labels = class_labels[1:]
+
+    dice_score = []
+    for label in class_labels:
+        tp, fp, fn, tn = compute_tp_fp_fn_tn(gt == label, pred == label, ignore_mask)
+        if tp + fp + fn != 0:
+            class_dice_score = float(2 * tp / (2 * tp + fp + fn))
+        else:
+            class_dice_score = np.nan
+        dice_score.append(class_dice_score)
+
+    dice_score = np.nanmean(dice_score)
+    dice_score = float(dice_score)
+    return dice_score
+
+
+def compute_tp_fp_fn_tn(mask_ref: np.ndarray, mask_pred: np.ndarray, ignore_mask: np.ndarray = None):
+    if ignore_mask is None:
+        use_mask = np.ones_like(mask_ref, dtype=bool)
+    else:
+        use_mask = ~ignore_mask
+    tp = np.sum((mask_ref & mask_pred) & use_mask)
+    fp = np.sum(((~mask_ref) & mask_pred) & use_mask)
+    fn = np.sum((mask_ref & (~mask_pred)) & use_mask)
+    tn = np.sum(((~mask_ref) & (~mask_pred)) & use_mask)
+    return tp, fp, fn, tn
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-gt', "--gt_dir", required=True, help="Path to the dense GT segmentations folder.")
+    parser.add_argument('-pred', "--pred_dir", required=True, help="Path to the dense prediction segmentation folder.")
+    parser.add_argument('-l', "--num_labels", required=True, type=int, help="The number of segmentation labels.")
+    parser.add_argument('-p', "--processes", required=False, default=None, help="Number of multiprocessing processes.")
+    args = parser.parse_args()
+
+    evaluate(args.gt_dir, args.gt_dir, args.num_labels)
diff --git a/scribblebench/scribble_generation.py b/scribblebench/scribble_generation.py
@@ -33,7 +33,7 @@ def generate_scribble_dataset(load_dir, save_dir, num_labels, conf_filepath, num
         num_processes = int(num_processes)    
 
     if names is None:
-        names = [path.name[:-7] for path in Path(load_dir).rglob("*.nii.gz")]
+        names = [path.name[:-7] for path in load_dir.rglob("*.nii.gz")]
     elif isinstance(names, str):
         names = [names]