SportsVideo: Fine Grained Action Classification and Position Detection in Table Tennis and Swimming Videos

See the MediaEval 2023 webpage for information on how to register and participate.

Task description

Positions and actions detection/classification are one of the main challenges in visual content analysis and mining. Sport video analysis has been a very popular research topic, due to the variety of application areas, ranging from analysis of athletes’ performances and rehabilitation to multimedia intelligent devices with user-tailored digests. We propose this year a series of 6 tasks, divided each into 2 sub-tasks for two sports, table tennis and swimming. Those tasks are a follow-up from the 2022 Sport Task and SwimTrack.

Task 1 - athletes positions detections

Subtask 1.1 (table tennis) - to detect 2 or 4 players (depending if single or double) and track them during the video especially during double games where players have a lot of overlaps, from videos recorded from various angles (e.g., side, corder).
Subtask 1.2 (swimming) - to detect up to 8 swimmers in the pool from static videos (recorded from the side of the pool).

Task 2 - strokes detection

Subtask 2.1 (table tennis) - to detect when a player is performing a stroke (i.e. a ball hit with the racket) using close-up videos
Subtask 2.2 (swimming) - to detect each time a swimmer is achieving a repeated motion for each swimming style (for freestyle, backstroke, and butterfly stroke once the swimmer’s right hand enters the water; for breaststroke once the head is at its highest point).

Task 3 - motion classification

Subtask 3.1 (table tennis) - to classify different strokes in table tennis from trimmed videos in which only one stroke is present. There are 3 different categories of strokes, services, forehand and backhand. For services we have 6 different classes. For forehand and backhand we have 5 classes. For a total of 16 classes and one non-stroke class.
Subtask 3.2 (swimming) - to classify different swimming styles (Freestyle, Backstroke, Breaststroke, Butterfly)

Task 4 - field/table registration

Subtask 4.1 (table tennis) - to detect the table position for a given video frame.
Subtask 4.2 (swimming) - to detect pool position for a given video frame.

Task 5 - sound detection

Subtask 5.1 (table tennis) - to detect when the ball hits the table or the racket.
Subtask 5.2 (swimming) - to detect buzzer sound. In swimming races, the beginning of a race is given by a buzzer sound to inform swimmers that they can start.

Task 6 - score and results extraction

Subtask 6.1 (table tennis) - to recognise the score of the match. In table tennis, the score of a match can be embedded in the broadcast video or it can be shown by referees with scoreboards. When score is embedded in stream video, names of players are also displayed.
Subtask 6.2 (swimming) - to recognise results of races. During swimmer competitions, after each race, results are displayed on digital boards. The goal is to recognise characters of these boards to obtain the results of races.

Target group

The task is of interest to researchers in the areas of machine learning (classification), visual content analysis, computer vision and sport performance. We explicitly encourage researchers focusing specifically in domains of computer-aided analysis of sport performance.

Data

Our focus is on recordings that have been made by both widespread and cheap video cameras, e.g. GoPro, but also high-quality videos, e.g. Blackmagick 4K.

Ground truth

Each video has been manually annotated by experts. For event-based annotations, we have annotated moments in the video that are relevant for the event. For positions we have annotated key and intermediate positions of the athlete and relied upon interpolation for the remaining positions.

Evaluation methodology

Each task will have its own evaluation methodology and will be provided once the dataset is released.

Quest for insight

Is RGB information alone is enough to obtain correct classification and detection performance? If not, what else should be used?
Which strokes or stroke rates are the most similar?
Is stroke rate constant within or between laps? or athletes?
How transferable are the computed features from one subtask to another?
How solving multiple tasks at the same time can improve the performance of each task?

Participant information

Please contact the task organizers by email if you have questions (see below).

References and recommended reading

The CRISP Project page

Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Julien Morlier. Three-Stream 3D/1D CNN for Fine-Grained Action Classification and Segmentation in Table Tennis. 4th International ACM Workshop on Multimedia Content Analysis in Sports, ACM Multimedia, Oct 2021, Chengdu, China.

Kaustubh Milind Kulkarni, Sucheth Shenoy: Table Tennis Stroke Recognition Using Two-Dimensional Human Pose Estimation. CVPR Workshops 2021: 4576-4584.

Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Julien Morlier. Fine grained sport action recognition with siamese spatio-temporal convolutional neural networks. Multimedia Tools and Applications, vol. 79, 20429–20447, Springer (2020).

Extended work in: Pierre-Etienne Martin. Fine-Grained Action Detection and Classification from Videos with Spatio-Temporal Convolutional Neural Networks. Application to Table Tennis. Neural and Evolutionary Computing [cs.NE]. Université de Bordeaux; Université de la Rochelle, 2020.

Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-Term Temporal Convolutions for Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1510–1517.

Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750 (2017).

Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions. CoRR abs/1705.08421 (2017).

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 hu- man actions classes from videos in the wild. CoRR 1212.0402 (2012).

Nicolas Jacquelin, Romain Vuillemot, and Stefan Duffner. 2021. Detecting Swimmers in Unconstrained Videos with Few Training Data. 8th Workshop on Machine Learning and Data Mining for Sports Analytics (Sept. 2021).

T. F. H. Runia, C. G. M. Snoek, and A. W. M. Smeulders. 2018. Real-World Repetition Estimation by Div, Grad and Curl. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9009–9017.

Timothy Woinoski, Alon Harell, and I. Bajić. 2020. Towards Automated Swimming Analytics Using Deep Neural Networks. ArXiv (2020).

Task organizers

Contact: romain.vuillemot@ec-lyon.fr

Aymeric Erades, Ecole Centrale de Lyon, LIRIS, France
Pierre-Etienne Martin, Max Planck Institute for Evolutionary Anthropology, CCP Department, Leipzig, Germany
Romain Vuillemot, Ecole Centrale de Lyon, LIRIS, France
Boris Mansencal, Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI
Renaud Péteri, MIA, University of La Rochelle, La Rochelle, France
Julien Morlier, IMS, University of Bordeaux, Talence, France
Stefan Duffner, INSA Lyon, LIRIS, France
Jenny Benois-Pineau, Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI

Task Schedule

August, 31st 2023: Data released
November, 15th 2023: Runs due
November, 22nd 2023: Results returned
15 December 2023: Working notes paper
1-2 February 2024: 14th Annual MediaEval Workshop, Collocated with MMM 2024 in Amsterdam, Netherlands and also online.