You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
blurb: "The goal is to use Visual Question Answering (VQA) to interpret and answer questions based on gastrointestinal images, aiming to enhance decision support and improve AI-driven medical decision-making. We provide a gastrointestinal dataset containing images and videos with VQA labels and additional metadata."
11
11
---
12
12
13
13
<!-- # please respect the structure below-->
14
14
*See the [MediaEval 2025 webpage](https://multimediaeval.github.io/editions/2025/) for information on how to register and participate.*
15
15
16
-
*See our [GitHub Repository](https://github.com/simula/MediaEval-Medico-2025) for the latest information about the task.*
16
+
*See our [GitHub Repository](https://github.com/simula/MediaEval-Medico-2025) for the latest information about the task. We encourage participants to check the repository regularly for updates.*
17
17
18
18
#### Task description
19
19
20
-
Gastrointestinal (GI) diseases are among the most common and critical health concerns worldwide, with conditions like colorectal cancer (CRC) requiring early diagnosis and intervention. AI-driven decision support systems have shown potential in assisting clinicians with diagnosis, but a major challenge remains: explainability. While deep learning models can achieve high diagnostic accuracy, their "black-box" nature limits their adoption in clinical practice, where trust and interpretability are essential. After successfully organizing multiple Medico challenges at MediaEval in previous years, we propose a new task for 2025: Medico: Visual Question Answering (VQA) for Gastrointestinal Imaging.
20
+
Gastrointestinal (GI) diseases are among the most common and critical health concerns worldwide, with conditions like colorectal cancer (CRC) requiring early diagnosis and intervention. AI-driven decision support systems have shown potential in assisting clinicians with diagnosis, but a major challenge remains: explainability. While deep learning models can achieve high diagnostic accuracy, their "black-box" nature limits their adoption in clinical practice, where trust and interpretability are essential. After successfully organizing multiple Medico challenges at MediaEval in previous years, we propose a new task for Medico 2025: **Visual Question Answering (with multimodal explanations) for Gastrointestinal Imaging**.
21
21
22
22
Medical Visual Question Answering (VQA) is a rapidly growing research area that combines computer vision and natural language processing to answer clinically relevant questions based on medical images. However, existing VQA models often lack transparency, making it difficult for healthcare professionals to assess the reliability of AI-generated answers. To address this, the Medico 2025 challenge will focus on explainable VQA for GI imaging, encouraging participants to develop models that provide not only accurate answers but also clear justifications aligned with clinical reasoning.
23
23
24
24
This challenge will offer a benchmark dataset containing GI images, videos, and associated VQA annotations, allowing for rigorous evaluation of AI models. By integrating multimodal data and explainability metrics, we aim to advance research in interpretable AI and improve the potential for clinical adoption.
25
25
26
26
We define two main subtasks for this year's challenge. Subtask 2 builds on Subtask 1, meaning Subtask 1 must be completed in order to participate in Subtask 2.
27
-
***Subtask 1: AI Performance on Medical Image Question Answering** - This subtask challenges participants to develop AI models that can accurately interpret and respond to clinical questions based on GI images from the Kvasir-VQA dataset, which includes 6,500 annotated images spanning various conditions and medical instruments. Questions fall into six categories: Yes/No, Single-Choice, Multiple-Choice, Color-Related, Location-Related, and Numerical Count, requiring models to process both visual and textual information. Performance will be assessed based on several quantitative metrics \[3\].
28
-
***Subtask 2: Clinician-Oriented Multimodal Explanations in GI** – This subtask extends Subtask 1 by focusing on the interpretability of model outputs for clinical decision-making. Models must not only generate accurate answers but also provide clear, multimodal explanations that enhance clinician trust and usability. Multimodality is required, meaning that explanations must integrate multiple forms of reasoning that work together to justify predictions. For example, models could highlight relevant image regions while providing textual reasoning grounded in medical knowledge and confidence scores. The goal is to align AI-driven insights with clinical reasoning, ensuring that justifications are interpretable, complementary, and useful in practice. Performance will be assessed based on explanation clarity and medical relevance, with expert reviewers evaluating how well the combined modalities support clinical decision-making.
27
+
28
+
***Subtask 1: AI Performance on Medical Image Question Answering**
29
+
This subtask challenges participants to develop AI models that can accurately interpret and respond to clinical questions based on GI images from the **Kvasir-VQA-x1** dataset, which contains 159,549 question–answer pairs from 6,500 original GI images, with additional weakly augmented images and complexity-level annotations. Questions fall into six main categories: Yes/No, Single-Choice, Multiple-Choice, Color-Related, Location-Related, and Numerical Count, as well as merged reasoning-based questions. Performance will be assessed using metrics such as BLEU, ROUGE (1/2/L), and METEOR, alongside medical correctness and relevance.
30
+
31
+
***Subtask 2: Clinician-Oriented Multimodal Explanations in GI**
32
+
This subtask extends Subtask 1 by focusing on the interpretability of model outputs for clinical decision-making. Models must not only generate accurate answers but also provide clear, multimodal explanations that enhance clinician trust and usability. Explanations must be faithful to the model’s reasoning, clinically relevant, and useful for real-world decision-making. Participants are encouraged to combine textual clinical reasoning with visual localization (e.g., heatmaps, segmentation masks, bounding boxes) and/or confidence measures. Performance will be assessed based on answer correctness, explanation clarity, visual alignment, confidence calibration, and medical relevance, as rated by expert reviewers.
29
33
30
34
#### Motivation and background
31
35
@@ -35,64 +39,65 @@ This challenge builds upon previous work in medical VQA, where AI models answer
35
39
36
40
#### Target group
37
41
38
-
We can actively invite people from multiple communities to submit solutions to the proposed task. We strongly believe that a significant fraction of multimedia researchers can contribute to the medical scenario. Therefore, we hope that many people are interested and involved on a personal level supporting a decision to work on the task and try out their ideas. To ensure that young researchers succeed, we will also provide mentoring for students that want to tackle the task (undergraduate and graduate levels are very welcome).
42
+
We invite participation from multiple communities, including computer vision, natural language processing, multimedia analysis, medical imaging, and human–AI interaction. We strongly believe that many multimedia researchers can contribute to this medical scenario, and we hope that many people will be personally motivated to take on the challenge and try out their ideas. To ensure that young researchers succeed, we will also provide mentoring for students at both undergraduate and graduate levels.
39
43
40
44
#### Data
41
45
42
-
The dataset for Medico 2025, Kvasir-VQA-x1 \[1, 2\], is a text-image pair gastrointestinal (GI) tract dataset built upon the HyperKvasir and Kvasir-Instrument datasets, now enhanced with question-and-answer annotations. It is specifically designed to support Visual Question Answering (VQA) tasks and other multimodal AI applications in GI diagnostics.
46
+
The dataset for Medico 2025, **Kvasir-VQA-x1**\[1, 2\], is a large-scale text–image pair gastrointestinal (GI) dataset built upon the HyperKvasir and Kvasir-Instrument datasets, now enhanced with 159,549 naturalized question–answer annotations, complexity-level scores for curriculum training, and weak augmentations (10 per original image). It is specifically designed to support Visual Question Answering (VQA) and other multimodal AI applications in GI diagnostics.
43
47
44
-
The dataset is available here: [https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1](https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1)
Subtask 1: Accuracy and Explainability in Answering GI Questions
49
-
50
-
The evaluation for this subtask will assess not only the correctness of the model’s answers but also their interpretability. Key metrics include:
51
-
* Accuracy: The proportion of correct responses.
52
-
* Precision: The proportion of true positive answers among all positive predictions.
53
-
* Recall: The proportion of true positive answers relative to actual positive cases.
54
-
* F1 Score: The harmonic mean of precision and recall.
52
+
**Subtask 1: VQA Performance**
53
+
* Metrics: BLEU, ROUGE (1/2/L), METEOR
54
+
* Settings: Original & augmented images
55
+
* Criteria: Accuracy, relevance, and medical correctness
55
56
56
-
Subtask 2: The evaluation for this subtask will consider both answer correctness and explanation quality. Key metrics include:
57
-
* Subtask 1 Metrics: The metrics used in subtask 1.
58
-
* Explainability Score: A metric assessing the clarity, coherence, and medical relevance of explanations, evaluated by medical experts.
57
+
**Subtask 2: Explainability**
58
+
* Metrics: All Subtask 1 metrics
59
+
* Expert-rated on:
60
+
* Answer correctness
61
+
* Clarity & clinical relevance
62
+
* Visual alignment
63
+
* Confidence calibration
64
+
* Methodology & novelty
59
65
60
66
#### Quest for insight
61
67
62
-
Here are several research questions related to this challenge that participants can strive to answer in order to go beyond just looking at the evaluation metrics:
63
-
* Which types of explanations align with clinical reasoning and enhance trust among medical professionals?
64
-
* How can visual attention mechanisms, uncertainty estimation, or multimodal reasoning be leveraged to provide meaningful justifications?
65
-
* How can preprocessing and post-processing techniques be optimized to improve explainability while maintaining accuracy?
68
+
Here are several research questions participants can strive to answer:
69
+
* Which types of explanations align best with clinical reasoning and enhance trust among medical professionals?
70
+
* How can visual attention mechanisms, uncertainty estimation, or multimodal reasoning be leveraged to provide meaningful justifications?
71
+
* How can preprocessing and post-processing techniques be optimized to improve explainability while maintaining accuracy?
66
72
* What are the most effective strategies for evaluating the quality and reliability of AI-generated explanations in GI diagnostics?
67
73
68
74
#### Participant information
69
-
More details will follow.
75
+
More details will follow on the competition repository. Please check it regularly: [https://github.com/simula/MediaEval-Medico-2025](https://github.com/simula/MediaEval-Medico-2025)
70
76
71
77
#### References and recommended reading
72
78
73
-
*References*
74
-
*\[1\] Sushant Gautam, Andrea Storås, Cise Midoglu, Steven A. Hicks, Vajira Thambawita, Pål Halvorsen, Michael A. Riegler, [Kvasir-VQA: A Text-Image Pair GI Tract Dataset](https://arxiv.org/abs/2409.01437)
75
-
*\[2\] Borgli, H., Thambawita, V., Smedsrud, P.H. et al. [HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy](https://www.nature.com/articles/s41597-020-00622-y)
79
+
*References*
80
+
*\[1\] Sushant Gautam, Andrea Storås, Cise Midoglu, Steven A. Hicks, Vajira Thambawita, Pål Halvorsen, Michael A. Riegler, [Kvasir-VQA: A Text-Image Pair GI Tract Dataset](https://arxiv.org/abs/2409.01437)
81
+
*\[2\] Borgli, H., Thambawita, V., Smedsrud, P.H. et al. [HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy](https://www.nature.com/articles/s41597-020-00622-y)
76
82
*\[3\] Hicks, S.A., Strümke, I., Thambawita, V. et al. [On evaluation metrics for medical applications of artificial intelligence](https://www.nature.com/articles/s41598-022-09954-8)
0 commit comments