Official repository for GroundedSurg, a grounding-based surgical vision benchmark introduced at MICCAI.
GroundedSurg reformulates surgical instrument perception as a language-conditioned, instance-level segmentation task.
Unlike conventional category-level segmentation benchmarks, GroundedSurg requires models to:
- Resolve natural-language references
- Disambiguate between multiple similar instruments
- Perform structured spatial grounding
- Produce precise pixel-level segmentation masks
- First language-conditioned surgical grounding benchmark
- Instance-level disambiguation across multi-instrument scenes
- Structured spatial grounding:
- Bounding box
- Center point
- Pixel-level mask
- Multi-procedure diversity
- Unified evaluation protocol for Vision-Language Models
| Statistic | Value |
|---|---|
| Number of images | ~612 |
| Tool annotations | ~1,071 |
| Average tools per image | ~1.6 |
| Surgical procedures | 4 |
| Annotation type | Pixel-level segmentation |
| Spatial grounding | Bounding box + Center point |
| Language descriptions | Instance-level |
- Ophthalmic Surgery
- Laparoscopic Cholecystectomy
- Robotic Nephrectomy
- Gastrectomy
Each benchmark instance consists of:
- Surgical image
I - Natural-language query
T - Bounding box
B - Center point
C - Ground-truth segmentation mask
M
Objective:
Models must localize and segment the instrument described by the query.
- IoU
- IoU@0.5 / IoU@0.9
- mIoU
- Dice
- Bounding Box IoU
- Normalized Distance Error (NDE)
All metrics are computed per image-query pair.
GroundedSurg follows a unified language-conditioned segmentation protocol:
- Vision-Language Model predicts:
- Bounding box
- Center point
- Predictions projected to segmentation backend (SAM2 / SAM3)
- Final mask evaluated against ground truth
GroundedSurg benchmarked:
- Qwen2.5-VL
- Qwen3-VL
- Gemma 3 (12B / 27B)
- LLaMA 3 Vision
- DeepSeek-VL2
- Mistral 3
- VisionReasoner
- Migician
- InternVL
- MedMO
- MedGemma
- MedVLM-R1
- BiMediX2
- GPT-4o-mini
- GPT-5.2
GroundedSurg/
βββ Model_Evaluation_Scripts/
β βββ gemma3.py
β βββ llama.py
β βββ qwen_2_5.py
β βββ qwen_3.py
β βββ mistral_3.py
β βββ intern_eval.py
β βββ med_mo.py
β βββ migician.py
β βββ MedGemma/
β βββ MedVLM-R1/
β βββ Segmentation/
β βββ sam2.py
β βββ sam3.py
β βββ mask_eval.sh
βββ Prompt/
βββ prompt1.txt
βββ prompt2.txtExample:
python Model_Evaluation_Scripts/qwen_2_5.pyFor segmentation backend:
python Model_Evaluation_Scripts/Segmentation/sam3.py---## π¦ Installation
β οΈ Note: Different Vision-Language Models may require separate environments depending on their official repositories.
We recommend creating dedicated environments per model when necessary.
conda create -n groundedsurg python=3.10
conda activate groundedsurgInstall PyTorch (adjust CUDA version if needed):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Install additional dependencies:
pip install transformers accelerate
pip install opencv-python
pip install numpy scipy tqdm
pip install pillow matplotlib
pip install scikit-imageGroundedSurg uses a frozen SAM-based segmentation backend.
Clone the official SAM3 repository:
git clone https://github.com/facebookresearch/sam3Follow the installation instructions from the official repository.
Download the pretrained SAM3 checkpoint from the official repository.
After downloading, update the checkpoint path inside:
Model_Evaluation_Scripts/Segmentation/sam3.py
GroundedSurg has been tested with:
- Python 3.10
- PyTorch 2.x
- CUDA 12.x
- Ubuntu 22.04
If you use GroundedSurg, please cite:
@inproceedings{groundedsurg2026,
title={GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation},
author={Tajamul Ashraf, Abrar ul Riyz, Wasif Tak , Tavaheed Tariq, Sonia Yadav, Moloud Abdar, Janibul Bashir},
booktitle={MICCAI},
year={2026}
}(To be added)
Abrar Ul Riyaz
https://abrarulriyaz.vercel.app
Gaash Research Lab
NIT Srinagar

