Musti: Multimodal Understanding of Smells in Texts and Images

See the MediaEval 2023 webpage for information on how to register and participate.

Task description

Smell is underrepresented in research on multimedia analysis and multimedia representation. The goal of the Musti is to address this gap and to further the understanding of descriptions and depictions of smells in texts and images. Participants are provided with multilingual texts (English, German, Italian, French, Slovenian) and images from the 17th to the 20th century, that pertain to smell (i.e., selected because they evoke smells). Participants must create automatic solutions that recognise references to smells in texts and images and connect these smell references across the different modalities.

Subtask 1: Musti classification (mandatory): Task participants develop language and image recognition technologies to predict whether a text passage and an image evoke the same smell source or not. This subtask can be cast as a binary classification problem.

Subtask 2: Musti detection (optional): Task participants are asked to identify what is (are) the common smell source(s) between the text passages and the images. The detection of the smell source includes detecting the person, object, or place that has a specific smell, or that produces odorous (e.g., plant, animal, perfume, human). In other words, the smell source is the entity or phenomenon that a perceiver experiences with his or her senses. This subtask can be cast as a multi-label classification problem.

Subtask 3: Musti zero-shot (optional): This evaluation contains languages that are not in the training data. Participants predict whether an image and a text passage evoke the same smell source or not (Subtask 1) and identify the common smell source(s) between the text passages and the images (Subtask 2). The training data is in English, French, German, and Italian. The test data will be in the Slovene language and for both classification and detection tasks.

Motivation and background

Smell offers a powerful and direct entry to our emotions and memories. To make sense of digital (heritage) collections, it is necessary to go beyond visual, oculo-centric perspectives and to engage with their olfactory dimension. Via the Musti task, we aim to accelerate the understanding of olfactory references in multilingual text and images as well as the connection between these modalities. As exhibitions at Mauritshuis in The Hague, Netherlands, Museum Ulm in Ulm, Germany, and the Prado Museum in Madrid, Spain demonstrate, museums and galleries are keen to enrich museum visits with olfactory components—either for a more immersive experience or to create a more inclusive experience for differently abled museum visitors such as those with a visual impairment.

Reinterpreting historical scents is attracting attention from various research disciplines (Huber et al., 2022), in some cases leading to interesting collaborations with perfume makers such as the Scent of the Golden Age candle developed after a recipe by Constantijn Huygens in a collaboration between historians and a perfume maker.

To ensure that such enrichments are grounded in historically correct contexts, language and computer vision technologies can aid in finding olfactory relevant examples in their collections and related sources.

Target group

The task is of interest to researchers interested in natural language processing, computer vision, multimedia analysis, and cultural heritage.


The MUSTI 2023 dataset consists of copyright-free texts and partly copyrighted images that could be downloaded and submitted by the participants using the URLs we provide. We offer texts in English, Dutch, French, German, Italian, and Slovene that participants are to match to the images.

The texts are selected from open repositories such as Project Gutenberg, Europeana, Royal Society Corpus, Deutsches Text Arxiv, Gallica, Wikisource and Liber Liber The images are selected from different archives such as RKD, Bildindex der Kunst und Architektur, Museum Boijmans, Ashmolean Museum Oxford, Plateforme ouverte du patrimoine.

The images are annotated with 169 categories of smell objects and gestures such as flowers, food, animals, sniffing and holding the nose. The object categories are organised in a two-level taxonomy.

The Odeuropa text and image benchmark datasets are available as additional training data to the participants. The image dataset consists of 4,696 images with 36,663 associated object annotations,~600 gesture annotations, and image level meta-data. We will also provide the output of a text processing system we have developed to identify text snippets that contain smell references.

The participants will be evaluated on a held-out dataset of roughly 1,200 images with associated texts in the four languages.

Ground truth

The ground truth consists of images and text snippets that contain appearences or mentions of smell related objects. If a text passage and an image evoke the same smell the relation between an image and a text passage is manually positive, otherwise negative. This dataset is distilled from the Odeuropa text and image benchmark datasets.

Evaluation methodology

Task runs will be evaluated against a gold standard consisting of image-text pairs. We will evaluate using multiple statistics as each provides a slightly different perspective on the results. We will provide the code and models of the baselines we developed for MUSTI 2022. Specifically, each subtask will be evaluated using the following metrics:

Subtask 1: Musti classification: Predicting whether an image and a text passage evoke the same smell source or not This subtask will be evaluated using precision, recall and F1-measure. As multiple text passages in different languages can be linked to the same image, we will employ multiple linking scorers such as CEAF and BLANC to measure the performance across different smell reference chains.

Subtask 2: Musti detection: Identifying the common smell source(s) between the text passages and the images For this subtask, precision, recall and F1-measure will be employed, as well as more fine-grained evaluation methods such as RUFES, which can accommodate multi-level taxonomies.

Subtask 3: Musti zero-shot The evaluation for this subtask will be the same as subtasks 1 and 2. The only difference is that there will not be training data for this subtask.

Quest for insight

Here are several research questions related to this challenge that participants can strive to answer in order to go beyond just looking at the evaluation metrics:

Participant information

A. Hürriyetoğlu., T. Paccosi, S. Menini, M. Zinnen, P. Lisena, K. Akdemir, … & M. van Erp, “MUSTI-Multimodal Understanding of Smells in Texts and Images at MediaEval 2022” In Proceedings of MediaEval 2022 CEUR Workshop, 2022, URL:

K. Akdemir, A. Hürriyetoğlu, R. Troncy., T. Paccosi, S. Menini, M. Zinnen, & V. Christlein, “Multimodal and Multilingual Understanding of Smells using VilBERT and mUNITER” (2022) In Proceedings of MediaEval 2022 CEUR Workshop. 2022. URL:

Y. Shao, Y. Zhang, W. Wan, J. Li, & J. Sun, “Multilingual Text-Image Olfactory Object Matching Based on Object Detection”, In Proceedings of MediaEval 2023 CEUR Workshop. 2022. URL:

B. Huber, T. Larsen, R. Spengler, and N. Boivin. “How to use modern science to reconstruct ancient scents” Nat Hum Behav (2022).

S. Ehrich, C., Verbeek, M. Zinnen, L. Marx, C. Bembibre, and I. Leemans, “Nose-First. Towards an Olfactory Gaze for Digital Art History.” In 2021 Workshops and Tutorials-Language Data and Knowledge, LDK 2021 (pp. 1-17). September 2021, Zaragoza, Spain.

P. Lisena, D. Schwabe, M. van Erp, R. Troncy, W. Tullett, I. Leemans, L. Marx, and S. Ehrich, “Capturing the semantics of smell: The Odeuropa data model for olfactory heritage information” in Proceedings of ESWC 2022, Extended Semantic Web Conference, May 29-June 2, 2022, Hersonissos, Greece.

S. Menini, T. Paccosi, S. Tonelli, M. van Erp, I. Leemans, P. Lisena, R. Troncy, W. Tullett, A. Hürriyetoğlu, G.Dijkstra, F. Gordijn, E. Jürgens, J. Koopman, A. Ouwerkerk, S. Steen, I. Novalija, J. Brank, D. Mladenic, and A. Zidar “A Multilingual Benchmark to Capture Olfactory Situations over Time” In Proceedings of LChange 2022. May 2022. Dublin, Ireland.

S. Menini, T. Paccosi, S. Tekiroğlu, and S. Tonelli “Building a Multilingual Taxonomy of Olfactory Terms with Timestamps” In Proceedings of Language Resources and Evaluation Conference (LREC) 2022. June 2022. Marseille, France.

S. Tonelli and S. Menini, “FrameNet-like annotation of olfactory information in texts” in Proceedings of the 5th joint SIGHUM workshop on computational linguistics for cultural heritage, social sciences, humanities and literature, Punta Cana, Dominican Republic (online), 2021, p. 11–20.

M. Zinnen and V. Christlein “Annotated Image Data version 1 - Odeuropa Deliverable D2.2”

Zinnen, Mathias, et al. “Odor: The ICPR2022 Odeuropa challenge on olfactory object recognition” 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.

Task organizers

Task facilitators

Task schedule


This task is an output of Odeuropa project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101004469.