See the MediaEval 2025 webpage for information on how to register and participate.
See our GitHub Repository for the latest information about the task. We encourage participants to check the repository regularly for updates.
Gastrointestinal (GI) diseases are among the most common and critical health concerns worldwide, with conditions like colorectal cancer (CRC) requiring early diagnosis and intervention. AI-driven decision support systems have shown potential in assisting clinicians with diagnosis, but a major challenge remains: explainability. While deep learning models can achieve high diagnostic accuracy, their “black-box” nature limits their adoption in clinical practice, where trust and interpretability are essential. After successfully organizing multiple Medico challenges at MediaEval in previous years, we propose a new task for Medico 2025: Visual Question Answering (with multimodal explanations) for Gastrointestinal Imaging.
Medical Visual Question Answering (VQA) is a rapidly growing research area that combines computer vision and natural language processing to answer clinically relevant questions based on medical images. However, existing VQA models often lack transparency, making it difficult for healthcare professionals to assess the reliability of AI-generated answers. To address this, the Medico 2025 challenge will focus on explainable VQA for GI imaging, encouraging participants to develop models that provide not only accurate answers but also clear justifications aligned with clinical reasoning.
This challenge will offer a benchmark dataset containing GI images, videos, and associated VQA annotations, allowing for rigorous evaluation of AI models. By integrating multimodal data and explainability metrics, we aim to advance research in interpretable AI and improve the potential for clinical adoption.
We define two main subtasks for this year’s challenge. Subtask 2 builds on Subtask 1, meaning Subtask 1 must be completed in order to participate in Subtask 2.
Subtask 1: AI Performance on Medical Image Question Answering
This subtask challenges participants to develop AI models that can accurately interpret and respond to clinical questions based on GI images from the Kvasir-VQA-x1 dataset, which contains 159,549 question–answer pairs from 6,500 original GI images, with additional weakly augmented images and complexity-level annotations. Questions fall into six main categories: Yes/No, Single-Choice, Multiple-Choice, Color-Related, Location-Related, and Numerical Count, as well as merged reasoning-based questions. Performance will be assessed using metrics such as BLEU, ROUGE (1/2/L), and METEOR, alongside medical correctness and relevance.
Subtask 2: Clinician-Oriented Multimodal Explanations in GI
This subtask extends Subtask 1 by focusing on the interpretability of model outputs for clinical decision-making. Models must not only generate accurate answers but also provide clear, multimodal explanations that enhance clinician trust and usability. Explanations must be faithful to the model’s reasoning, clinically relevant, and useful for real-world decision-making. Participants are encouraged to combine textual clinical reasoning with visual localization (e.g., heatmaps, segmentation masks, bounding boxes) and/or confidence measures. Performance will be assessed based on answer correctness, explanation clarity, visual alignment, confidence calibration, and medical relevance, as rated by expert reviewers.
Medical AI systems must be both accurate and interpretable to be useful in clinical practice. While deep learning models have shown great potential in diagnosing gastrointestinal (GI) conditions from medical images, their adoption remains limited due to a lack of transparency. Clinicians need to understand why an AI system makes a specific decision, especially when it comes to critical medical diagnoses. Explainable AI (XAI) methods aim to bridge this gap by providing justifications that align with clinical reasoning, improving trust, reliability, and ultimately patient outcomes.
This challenge builds upon previous work in medical VQA, where AI models answer clinically relevant questions based on GI images. However, traditional VQA models often provide answers without explanations, making it difficult for medical professionals to assess their validity. By incorporating explainability into the task, we encourage the development of models that not only provide accurate responses but also offer meaningful insights into their decision-making process. This will help ensure that AI systems can be safely integrated into clinical workflows, assisting rather than replacing human expertise.
We invite participation from multiple communities, including computer vision, natural language processing, multimedia analysis, medical imaging, and human–AI interaction. We strongly believe that many multimedia researchers can contribute to this medical scenario, and we hope that many people will be personally motivated to take on the challenge and try out their ideas. To ensure that young researchers succeed, we will also provide mentoring for students at both undergraduate and graduate levels.
The dataset for Medico 2025, Kvasir-VQA-x1 [1, 2], is a large-scale text–image pair gastrointestinal (GI) dataset built upon the HyperKvasir and Kvasir-Instrument datasets, now enhanced with 159,549 naturalized question–answer annotations, complexity-level scores for curriculum training, and weak augmentations (10 per original image). It is specifically designed to support Visual Question Answering (VQA) and other multimodal AI applications in GI diagnostics.
Dataset: https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1
Subtask 1: VQA Performance
Subtask 2: Explainability
Here are several research questions participants can strive to answer:
More details will follow on the competition repository. Please check it regularly: https://github.com/simula/MediaEval-Medico-2025
References
Recommended
The program will be updated with the exact dates.