Medico: VQA (with multimodal explanations) for gastrointestinal imaging

See the MediaEval 2025 webpage for information on how to register and participate.

Task description

Gastrointestinal (GI) diseases are among the most common and critical health concerns worldwide, with conditions like colorectal cancer (CRC) requiring early diagnosis and intervention. AI-driven decision support systems have shown potential in assisting clinicians with diagnosis, but a major challenge remains: explainability. While deep learning models can achieve high diagnostic accuracy, their “black-box” nature limits their adoption in clinical practice, where trust and interpretability are essential. After successfully organizing multiple Medico challenges at MediaEval in previous years, we propose a new task for 2025: Medico: Visual Question Answering (VQA) for Gastrointestinal Imaging.

Medical Visual Question Answering (VQA) is a rapidly growing research area that combines computer vision and natural language processing to answer clinically relevant questions based on medical images. However, existing VQA models often lack transparency, making it difficult for healthcare professionals to assess the reliability of AI-generated answers. To address this, the Medico 2025 challenge will focus on explainable VQA for GI imaging, encouraging participants to develop models that provide not only accurate answers but also clear justifications aligned with clinical reasoning.

This challenge will offer a benchmark dataset containing GI images, videos, and associated VQA annotations, allowing for rigorous evaluation of AI models. By integrating multimodal data and explainability metrics, we aim to advance research in interpretable AI and improve the potential for clinical adoption.

We define two main subtasks for this year’s challenge. Subtask 2 builds on Subtask 1, meaning Subtask 1 must be completed in order to participate in Subtask 2.

Motivation and background

Medical AI systems must be both accurate and interpretable to be useful in clinical practice. While deep learning models have shown great potential in diagnosing gastrointestinal (GI) conditions from medical images, their adoption remains limited due to a lack of transparency. Clinicians need to understand why an AI system makes a specific decision, especially when it comes to critical medical diagnoses. Explainable AI (XAI) methods aim to bridge this gap by providing justifications that align with clinical reasoning, improving trust, reliability, and ultimately patient outcomes.

This challenge builds upon previous work in medical VQA, where AI models answer clinically relevant questions based on GI images. However, traditional VQA models often provide answers without explanations, making it difficult for medical professionals to assess their validity. By incorporating explainability into the task, we encourage the development of models that not only provide accurate responses but also offer meaningful insights into their decision-making process. This will help ensure that AI systems can be safely integrated into clinical workflows, assisting rather than replacing human expertise.

Target group

We can actively invite people from multiple communities to submit solutions to the proposed task. We strongly believe that a significant fraction of multimedia researchers can contribute to the medical scenario. Therefore, we hope that many people are interested and involved on a personal level supporting a decision to work on the task and try out their ideas. To ensure that young researchers succeed, we will also provide mentoring for students that want to tackle the task (undergraduate and graduate levels are very welcome).

Data

The dataset for Medico 2025, Kvasir-VQA [1, 2], is a text-image pair gastrointestinal (GI) tract dataset built upon the HyperKvasir and Kvasir-Instrument datasets, now enhanced with question-and-answer annotations. It is specifically designed to support Visual Question Answering (VQA) tasks and other multimodal AI applications in GI diagnostics. The dataset includes 6,500 annotated GI images, spanning a range of conditions and medical instruments used in procedures.

Annotations in Kvasir-VQA were developed with input from medical professionals and include six key types of questions:

Each question is designed to test AI models on different aspects of clinical decision-making, such as recognizing abnormalities, identifying anatomical landmarks, or interpreting findings based on image features.

Evaluation methodology

Subtask 1: Accuracy and Explainability in Answering GI Questions

The evaluation for this subtask will assess not only the correctness of the model’s answers but also their interpretability. Key metrics include:

Subtask 2: The evaluation for this subtask will consider both answer correctness and explanation quality. Key metrics include:

Quest for insight

Here are several research questions related to this challenge that participants can strive to answer in order to go beyond just looking at the evaluation metrics:

Participant information

More details will follow.

References

Recommended

Task organizers

Task schedule

The program will be updated with the exact dates.