MultiSumm: Multimodal summarization of multiple topically related websites

See the MediaEval 2026 webpage for information on how to register and participate.

Task description

The goal of the MultiSumm task is to explore the creation of multimodal summaries from multiple multimodal content items. Specifically, at MedaEval 2026, MultiSumm will explore the multimodal summarization of multiple websites. The websites for summarization will be provided by the H2020 Cultivate project. Cultivate is exploring online resources relating to Food Sharing Initiatives (FSIs) in urban and peri-urban environments for cities around the world. A key element of the Cultvate project is the creation of the ShareCity200 database. ShareCity200 will consist of an automatically crawled and curated database of FSIs identified using automated crawling present in 200 cities, primarily European, but also including cities from international territories beyond Europe. ShareCity200 is an extension and exploration of the ShareCity100 database created as part of an earlier project.

Participants in MultiSumm will be provided with the crawled FSI web content for a small number of selected cities and asked to create a multimodal summary of the FSIs present in each city. Participants will be provided with details of the requirements for the summaries and details of the summary evaluation methods to be used.

Since the ShareCity200 database will include details of FSIs in cities in many countries, and we are seeking to automate the evaluation process as much as possible, we will be open to including specific cities at the request of individual participants to expand the linguistic scope of the task.

We define two tasks, main task and subtask for additional investigations:

Main Task: Summarization of FSIs in English-Speaking Cities

Participants will be asked to build a large language model-based summarization system that produces high-quality, detailed summaries of FSIs in cities such as Dublin (Ireland) and Brighton & Hove (U.K.). The summarization output must reflect:

Subtask: Cross-Cultural and Geographically Grounded Summarization

This subtask extends the MultiSumm challenge along linguistic, cultural, and geographic dimensions, encouraging participants to explore how summarization models perform across diverse urban contexts. The subtask includes the following cities:

The output does not require map generation. Instead, participants may: describe spatial patterns textually (e.g. “FSIs are concentrated in inner-city districts…”), or include a lightweight structured component indicating district-level density categories. This geographic extension is optional, applicable primarily to the subtask cities (London, Barcelona, Milan), not compulsory for participation or evaluation.

Participants who wish to apply the same district-level heatmap analysis to the main task cities (Dublin and Brighton) are also very welcome to do so, and such submissions will be considered positively in the qualitative analysis. The summarization requirements are the same as for the main task, but the challenge here includes cross-lingual understanding and translation, alignment of heterogeneous data sources, and increased complexity in the urban FSI ecosystem. Participants are encouraged to propose additional cities, especially in different linguistic contexts, to support the multilingual vision of the ShareCity200 dataset and broaden the impact of their models.

Motivation and background:

Multidocument summarization for text documents has been a longstanding area of investigation. For example, for providing single summaries of multiple news articles on the same story. Traditionally, this process has been complex and inflexible in terms of content style and test, requiring the use of a wide variety of natural language processing (NLP) tools and detailed specification of the summarization process. The emergence of large language models (LLM) technologies has revolutionized many NLP tasks, including summarization. The more recent arrival of multimodal LLMs is similarly impacting on topics relating to multimedia content. While the MultiSumm tasks could be tackled using traditional NLP and multimedia processing tools, the expectation is that participants will tackle it using multimodal LLM methods. To the best of our knowledge, this will be the first benchmark task focusing on this topic and providing a potentially valuable venue for exploration of the potential and challenges of use of multimodal LLMs in tasks of this sort.

Target group

Researchers exploring the use of multimodal LLMs, potentially drawn from both the NLP and multimedia research communities. One of the nice features of LLM methods is that they enable researchers to engage with tasks for which they are not experts with the methods and tools traditionally used to address them.

Data

Quest for insight

Here are several research questions related to this challenge that participants can strive to answer in order to go beyond just looking at the evaluation metrics:

Task organizers

Gareth J. F. Jones, Maynooth University, Ireland

Anastasia Potyagalova, DCU, Ireland

Task schedule

Registration for task participation opens: January 2026

Test data release: 1 March 2026

Runs due: 1 May 2026

Working notes papers due: 31 May 2026

MediaEval 2026 Workshop, Sat.-Sun. 15-16 June 2026, Amsterdam, Netherlands and Online, co-located with ACM ICMR 2026