You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _editions/2025/tasks/medico.md
+63-27Lines changed: 63 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,49 +15,85 @@ blurb: "The goal is to use Visual Question Answering (VQA) to interpret and answ
15
15
16
16
#### Task description
17
17
18
+
Gastrointestinal (GI) diseases are among the most common and critical health concerns worldwide, with conditions like colorectal cancer (CRC) requiring early diagnosis and intervention. AI-driven decision support systems have shown potential in assisting clinicians with diagnosis, but a major challenge remains: explainability. While deep learning models can achieve high diagnostic accuracy, their "black-box" nature limits their adoption in clinical practice, where trust and interpretability are essential. After successfully organizing multiple Medico challenges at MediaEval in previous years, we propose a new task for 2025: Medico: Visual Question Answering (VQA) for Gastrointestinal Imaging.
19
+
20
+
Medical Visual Question Answering (VQA) is a rapidly growing research area that combines computer vision and natural language processing to answer clinically relevant questions based on medical images. However, existing VQA models often lack transparency, making it difficult for healthcare professionals to assess the reliability of AI-generated answers. To address this, the Medico 2025 challenge will focus on explainable VQA for GI imaging, encouraging participants to develop models that provide not only accurate answers but also clear justifications aligned with clinical reasoning.
21
+
22
+
This challenge will offer a benchmark dataset containing GI images, videos, and associated VQA annotations, allowing for rigorous evaluation of AI models. By integrating multimodal data and explainability metrics, we aim to advance research in interpretable AI and improve the potential for clinical adoption.
23
+
24
+
We define two main subtasks for this year's challenge. Subtask 2 builds on Subtask 1, meaning Subtask 1 must be completed in order to participate in Subtask 2.
25
+
***Subtask 1: AI Performance on Medical Image Question Answering** - This subtask challenges participants to develop AI models that can accurately interpret and respond to clinical questions based on GI images from the Kvasir-VQA dataset, which includes 6,500 annotated images spanning various conditions and medical instruments. Questions fall into six categories: Yes/No, Single-Choice, Multiple-Choice, Color-Related, Location-Related, and Numerical Count, requiring models to process both visual and textual information. Performance will be assessed based on several quantitative metrics.
26
+
***Subtask 2: Clinician-Oriented Multimodal Explanations in GI** – This subtask extends Subtask 1 by focusing on the interpretability of model outputs for clinical decision-making. Models must not only generate accurate answers but also provide clear, multimodal explanations that enhance clinician trust and usability. Multimodality is required, meaning that explanations must integrate multiple forms of reasoning that work together to justify predictions. For example, models could highlight relevant image regions while providing textual reasoning grounded in medical knowledge and confidence scores. The goal is to align AI-driven insights with clinical reasoning, ensuring that justifications are interpretable, complementary, and useful in practice. Performance will be assessed based on explanation clarity and medical relevance, with expert reviewers evaluating how well the combined modalities support clinical decision-making.
27
+
18
28
#### Motivation and background
19
29
30
+
Medical AI systems must be both accurate and interpretable to be useful in clinical practice. While deep learning models have shown great potential in diagnosing gastrointestinal (GI) conditions from medical images, their adoption remains limited due to a lack of transparency. Clinicians need to understand why an AI system makes a specific decision, especially when it comes to critical medical diagnoses. Explainable AI (XAI) methods aim to bridge this gap by providing justifications that align with clinical reasoning, improving trust, reliability, and ultimately patient outcomes.
31
+
32
+
This challenge builds upon previous work in medical VQA, where AI models answer clinically relevant questions based on GI images. However, traditional VQA models often provide answers without explanations, making it difficult for medical professionals to assess their validity. By incorporating explainability into the task, we encourage the development of models that not only provide accurate responses but also offer meaningful insights into their decision-making process. This will help ensure that AI systems can be safely integrated into clinical workflows, assisting rather than replacing human expertise.
33
+
20
34
#### Target group
21
35
36
+
We can actively invite people from multiple communities to submit solutions to the proposed task. We strongly believe that a significant fraction of multimedia researchers can contribute to the medical scenario. Therefore, we hope that many people are interested and involved on a personal level supporting a decision to work on the task and try out their ideas. To ensure that young researchers succeed, we will also provide mentoring for students that want to tackle the task (undergraduate and graduate levels are very welcome).
37
+
22
38
#### Data
23
39
40
+
The dataset for Medico 2025, Kvasir-VQA, is a text-image pair gastrointestinal (GI) tract dataset built upon the HyperKvasir and Kvasir-Instrument datasets, now enhanced with question-and-answer annotations. It is specifically designed to support Visual Question Answering (VQA) tasks and other multimodal AI applications in GI diagnostics. The dataset includes 6,500 annotated GI images, spanning a range of conditions and medical instruments used in procedures.
41
+
42
+
Annotations in Kvasir-VQA were developed with input from medical professionals and include six key types of questions:
43
+
* Yes/No Questions
44
+
* Single-Choice Questions
45
+
* Multiple-Choice Questions
46
+
* Color-Related Questions
47
+
* Location-Related Questions
48
+
* Numerical Count Questions
49
+
50
+
Each question is designed to test AI models on different aspects of clinical decision-making, such as recognizing abnormalities, identifying anatomical landmarks, or interpreting findings based on image features.
51
+
24
52
#### Ground truth
25
53
26
-
#### Evaluation methodology
54
+
Subtask 1: Accuracy and Explainability in Answering GI Questions
55
+
56
+
The evaluation for this subtask will assess not only the correctness of the model’s answers but also their interpretability. Key metrics include:
57
+
* Accuracy: The proportion of correct responses.
58
+
* Precision: The proportion of true positive answers among all positive predictions.
59
+
* Recall: The proportion of true positive answers relative to actual positive cases.
60
+
* F1 Score: The harmonic mean of precision and recall.
61
+
62
+
Subtask 2: The evaluation for this subtask will consider both answer correctness and explanation quality. Key metrics include:
63
+
* Subtask 1 Metrics: The metrics used in subtask 1.
64
+
* Explainability Score: A metric assessing the clarity, coherence, and medical relevance of explanations, evaluated by medical experts.
27
65
28
66
#### Quest for insight
67
+
29
68
Here are several research questions related to this challenge that participants can strive to answer in order to go beyond just looking at the evaluation metrics:
30
-
*<!-- # First research question-->
31
-
*<!-- # Second research question-->
32
-
<!-- # and so on-->
69
+
* How can AI models generate explanations that align with clinical reasoning and enhance trust among medical professionals?
70
+
* What techniques improve the transparency of deep learning models when answering medical VQA tasks?
71
+
* How can visual attention mechanisms, uncertainty estimation, or multimodal reasoning be leveraged to provide meaningful justifications?
72
+
* What are the most effective strategies for evaluating the quality and reliability of AI-generated explanations in GI diagnostics?
73
+
* How can preprocessing and post-processing techniques be optimized to improve explainability while maintaining accuracy?
74
+
75
+
#### Risk Management
76
+
77
+
The task has been held at MediaEval for several years, and has previously been very successful. We will prepare several documents and materials that will help participants get started, such as beginner code and example submissions. Also, we will be contacting participants from previous years.
33
78
34
79
#### Participant information
35
-
<!-- Please contact your task organizers with any questions on these points. -->
36
-
<!-- # * Signing up: Fill in the [registration form]() and fill out and return the [usage agreement](). -->
37
-
<!-- # * Making your submission: To be announced (check the task read me) <!-- Please add instructions on how to create and submit runs to your task replacing "To be announced." -->
38
-
<!-- # * Preparing your working notes paper: Instructions on preparing you working notes paper can be found in [MediaEval 2023 Working Notes Paper Instructions]().-->
80
+
More details will follow.
39
81
40
82
#### References and recommended reading
41
-
<!-- # Please use the ACM format for references https://www.acm.org/publications/authors/reference-formatting (but no DOI needed)-->
42
-
<!-- # The paper title should be a hyperlink leading to the paper online-->
83
+
More details will follow.
43
84
44
85
#### Task organizers
45
-
*<!-- # First organizer-->
46
-
*<!-- # Second organizer-->
47
-
<!-- # and so on-->
48
-
49
-
#### Task auxiliaries
50
-
<!-- # optional, delete if not used-->
51
-
*<!-- # First auxiliary-->
52
-
*<!-- # Second auxiliary-->
53
-
<!-- # and so on-->
86
+
* Sushant Gautam, SimulaMet, Norway
87
+
* Vajira Thambawita, SimulaMet, Norway
88
+
* Pål Halvorsen, SimulaMet, Norway
89
+
* Michael A. Riegler, SimulaMet, Norway
90
+
* Steven A. Hicks, SimulaMet, Norway
54
91
55
92
#### Task schedule
56
-
* XX May 2025: Development Data release <!-- * XX May 2025: Data release <!-- # Replace XX with your date. We suggest setting the date in May - of course if you want to realease sooner it's OK. -->
57
-
* XX June 2025: Development Data release <!-- * XX June 2025: Data release <!-- # Replace XX with your date. We suggest setting the date in June - of course if you want to realease sooner it's OK. -->
58
-
* XX September 2025: Runs due and results returned. Exact dates to be announced. <!--* XX September 2025: Runs due <!-- # Replace XX with your date. We suggest setting enough time in order to have enough time to assess and return the results by the Results returned.-->
59
-
* 08 October 2025: Working notes paper <!-- Fixed. Please do not change.-->
60
-
* 25-26 October 2025: MediaEval Workshop, Dublin, Ireland and Online.. <!-- Fixed. Please do not change.-->
61
-
62
-
#### Acknowledgements
63
-
<!-- # optional, delete if not used-->
93
+
The program will be updated with the exact dates.
94
+
95
+
* May 2025: Development Data release
96
+
* June 2025: Development Data release
97
+
* September 2025: Runs due and results returned. Exact dates to be announced.
98
+
* 08 October 2025: Working notes paper
99
+
* 25-26 October 2025: MediaEval Workshop, Dublin, Ireland and Online..
0 commit comments