No-Audio Multimodal Speech Detection Task

See the MediaEval 2020 webpage for information on how to register and participate.

Task Description

Task participants are provided with video of individuals participating in a conversation that was captured by an overhead camera. Each individual is also wearing a badge-like device, recording tri-axial acceleration.

The goal of the task is to automatically estimate when the person seen in the video starts speaking, and when they stop speaking using these alternative modalities. In contrast to conventional speech detection, for this task, no audio is used. Instead, the automatic estimation system must exploit the natural human movements that accompany speech (i.e., speaker gestures, as well as shifts in pose and proximity).

This task consists of two subtasks, with a new optional subtask:

Speaking predictions must be made for every second. However, it is left to the teams if they decide to use a different interval length and later interpolate or extrapolate to the second level.

Motivation and Background

An important but under-explored problem is the automated analysis of conversational dynamics in large unstructured social gatherings such as networking or mingling events. Research has shown that attending such events contributes greatly to career and personal success [7]. While much progress has been made in the analysis of small pre-arranged conversations, scaling up robustly presents a number of fundamentally different challenges.

This task focuses on analysing one of the most basic elements of social behaviour: the detection of speaking turns. Research has shown the benefit of deriving features from speaking turns for estimating many different social constructs such as dominance, or cohesion to name but a few. Unlike traditional tasks that have used audio to do this, here the idea is to leverage the body movements (i.e. gestures) that are performed during speech production which are captured from video and/or wearable acceleration and proximity. The benefit of this is that it enables a more privacy-preserving method of extracting socially relevant information and has the potential to scale to settings where recording audio may be impractical.

The relationship between body behaviour such as gesturing while speaking has been well-documented by social scientists [1]. Some efforts have been made in recent years to try and estimate these behaviours from a single body-worn triaxial accelerometer, hung around the neck [2,3]. This form of sensing could be embedded into a smart ID badge that could be used in settings such as conferences, networking events, or organizational settings. In other works, video has been used to estimate speaking status [4,5]. Despite these efforts, one of the major challenges has been in getting competitive estimation performance compared to audio-based systems. As yet, exploiting the multi-modal aspects of the problem is an under-explored area that will be the main focus of this challenge.

Target Group

This challenge is targeted at researchers in wearable devices, computer vision, signal and speech processing. The aim is to provide an entry-level task that has a clearly definable ground truth. There are many nuances to this problem that would enable this problem to be solved better if an intuition behind the behaviour is better understood. The problem could also be solved without this knowledge. The hope, however, is that this task will allow researchers who are not familiar with social signal processing to learn more about the problem domain; we have subsequent challenges in mind in later years that would become increasingly complex in terms of the social context and social constructs that are not so easily understood in terms of their social cue representation (e.g. personality, attraction, conversational involvement). The recommended readings for the challenge are [3,5,6]. Reading references [1,2,4] may provide additional insights on how to solve the problem.


The data consists of 70 people who attended one of three separate mingle events (cocktail parties). Overhead camera data as well as wearable tri-axial accelerometer data for an interval of 30 minutes is available for this task. Each person used a wearable device (to record the acceleration acceleration) hung around the neck as a conference badge. A subset of this data will be kept as a test set. All the samples of this test set will be for subjects who are not in the training set.

All the data is synchronized. The video data is mostly complete, with some segments missing as the participants can leave the recording area at any time (e.g. to go to the bathroom). The frame rate of the video and sample rate of the accelerometer data are captured at 20Hz. Note that due to the crowded nature of the events, there can be strong occlusions between participants in the video, which we hope to evaluate in one of our sub-tasks.

Evaluation Methodology

Manual annotations are provided for binary speaking status (speaking / non-speaking) for all people. These annotations are carried out for every frame in video (20 FPS). As mentioned above, speaking predictions must be made for every second.

Since the classes are severely imbalanced, we will be using the Area Under the ROC Curve (ROC-AUC) as the evaluation metric. Thus, participants should submit non-binary prediction scores (posterior probabilities, distances to the separating hyperplane, etc.).

The task will be evaluated using a subset of the data left as a test set. All the samples of this test set will be for subjects who are not present in the training set.

For evaluation, we will ask the teams to provide the following estimations for the two subtasks states above (unimodal and multimodal):

[1] McNeill, D.: Language and gesture, vol. 2. Cambridge University Press (2000)

[2] Hung, H., Englebienne, G., Kools, J.: Classifying social actions with a single accelerometer. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp. 207–210. ACM (2013)

[3] Gedik, E. and Hung, H., Personalised models for speech detection from body movements using transductive parameter transfer, Journal of Personal and Ubiquitous Computing, (2017)

[4] Hung, H. and Ba, S. O., Speech/non-speech Detection in Meetings from Automatically Extracted Low Resolution Visual Features, Idiap Research Report, (2010)

[5] Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M. , and Murino, V., Look at who’s talking: Voice activity detection by automated gesture analysis, In the workshop on Interactive Human Behavior Analysis in Open or Public Spaces, International Joint Conference on Ambient Intelligence, (2011).

[6] Cabrera-Quiros, L., Demetriou, A., Gedik, E., van der Meij, L., & Hung, H. (2018). The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates. IEEE Transactions on Affective Computing.

[7] Wolff, H.-G. and Moser, K. , Effects of networking on career success: a longitudinal study. Journal of Applied Psychology, 94(1):196, (2009).

Task Organizers

Task Schedule

Workshop will be held online.