SHREC 2021: Skeleton-based hand gesture recognition in the wild

doi:10.1016/j.cag.2021.07.007

Computers & Graphics

Volume 99, October 2021, Pages 201-211

https://doi.org/10.1016/j.cag.2021.07.007 Get rights and content

Highlights

•
3D Shape Retrieval Challenge 2021 at 3DOR’21 Track on Skeleton-based Hand Gesture Recognition in the Wild.
•
New gesture dataset with 180 gestures sequences and 18 gestures dictionary.
•
Contest with 4 groups presenting their gesture recognition methods.
•
Report of results and performances for all the methods.

Abstract

Gesture recognition is a fundamental tool to enable novel interaction paradigms in a variety of application scenarios like Mixed Reality environments, touchless public kiosks, entertainment systems, and more. Recognition of hand gestures can be nowadays performed directly from the stream of hand skeletons estimated by software provided by low-cost trackers (Ultraleap) and MR headsets (Hololens, Oculus Quest) or by video processing software modules (e.g. Google Mediapipe). Despite the recent advancements in gesture and action recognition from skeletons, it is unclear how well the current state-of-the-art techniques can perform in a real-world scenario for the recognition of a wide set of heterogeneous gestures, as many benchmarks do not test online recognition and use limited dictionaries. This motivated the proposal of the SHREC 2021: Track on Skeleton-based Hand Gesture Recognition in the Wild. For this contest, we created a novel dataset with heterogeneous gestures featuring different types and duration. These gestures have to be found inside sequences in an online recognition scenario. This paper presents the result of the contest, showing the performances of the techniques proposed by four research groups on the challenging task compared with a simple baseline method.

Graphical abstract

Introduction

The recognition of gestures based on hand skeleton tracking is becoming the default interaction method for the new generation of Virtual Reality (VR) and Mixed Reality (MR) devices like Oculus Quest and Microsoft Hololens, implementing specific, advanced solutions [5], [16]. Low-cost hand tracking devices with good performances are available since 2010 [19] and are used in several application domains and research works. Real-time hand pose tracking is now possible from single-camera input using Google tools [22]. It is, therefore, extremely likely that most of the future hand gesture recognition tools will work directly on the hand skeleton poses and not on RGB or depth images. These facts strongly motivate research efforts aimed at the development of such tools. In practical application scenarios, these gesture recognizers need to work in real-time and to be able to detect and correctly label gestures ”in the wild” within a continuous sequence of hand movements.

Several methods have been recently proposed in the literature for the skeleton-based gesture recognition task. However, as pointed out in [1], current available benchmarks that focus on online-recogntion scenarios are limited. Many of them do not test recognizers in an online setting or evaluate the methods on limited vocabularies and not including many gesture types. Hand gestures, in fact, can be classified into different types according to their distinctive features. Some gestures are static, characterized by keeping a fixed hand pose for a minimum amount of time. Others are dynamic and characterized by a single trajectory with the hand pose that does not change or it is not semantically relevant. Others are dynamic and characterized not only by a global motion, but also by the evolution of fingers’ articulation over time.

Previous contests organized on skeleton-based gesture recognition were limited to offline recognition (SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset [3]) or featured a very limited dictionary of gestures (SHREC 2019 track on online gesture detection [2]).

For this reason, we created a novel dataset including 18 gesture classes belonging to different types. A subset of 7 classes are static, characterized by a hand pose kept fixed for at least one second (One, Two, Three, Four, OK, Menu, Pointing). The remaining ones are dynamic, 5 coarse, characterized by a single global trajectory of the hand (Left, Right, Circle, V, Cross) and 6 fine, characterized by variations in the fingers’ articulation (Grab, Pinch, Tap, Deny, Knob, Expand). Fig. 1 shows the gestures’ templates. A peculiarity of the data collected is that the gestures are executed within long sequences of hand gesticulation, as they were captured during generic user interaction.

Given the dataset, we proposed an online recognition task within the Eurographics SHREC 2021 framework. This paper reports on the outcomes of the contest’s result. The paper is organized as follows: Section 2 presents the novel dataset, Section 3 the proposed task and the evaluation method, Section 4 presents the groups participating in the contest and the methods proposed together with a baseline method.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Dataset creation

The dataset created for the contest is a collection of 180 gesture sequences. Each sequence, captured using a Leap Motion Device, features either 3, 4, or 5 gestures interleaved with non-significant gesticulation. The dictionary used consists of 18 classes of gestures, each appearing in the dataset an equal amount of time (40 occurrences per class).

Gestures were performed by five different subjects in pre-determined sequences. The execution of the dictionary gestures followed specific templates

Task proposed and evaluation

The goal of the participants was to detect correctly the gestures included in the sequence with an online detection approach. The gesture database captured was split as described into a training set with associated annotations of gesture time stamp and labels, that could be used to train the detection algorithms and a test set with no annotations available. Participants had to provide a list of the gestures detected in the test set with associated labels, start time stamps and end timestamps.

Participants and methods

Five research groups were registered for the contest and sent results, but one retired after the evaluation. Each group sent up to three annotation files that were obtained with different methods or parameters’ settings. The methods are described in the following subsection, together with the simple technique that we used as baseline.

Evaluation results

A summary of the results for each group, averaged over all the gestures, is presented in Table 1. The table also shows the results of the execution time of the methods by reporting the Total Time and the Classification Time. The first is a measure of the time each method takes to compute results for the entire test set (i.e. all the sequences), the second, is a measure of the average time needed, for each method, to perform a single gesture classification. Times show that all the methods are

Discussion

The evaluation outcomes provide useful insights for the design of online gesture recognizers usable ”in the wild”. The techniques tested provide promising scores given a limited number of annotated sequences for training. Given the short amount of time available for the contest this is a good result.

A nice aspect of the submissions received is that the proposed methods are quite different from each other and are exemplars of the principal network-based approaches proposed in the literature for

Conclusions

The development of effective and flexible gesture recognizers able to detect and correctly classify hand gestures of different kinds is fundamental not only to enable advanced user interfaces for Virtual and Mixed Reality applications, but also, for example, to enable the realization of touchless interfaces like public kiosks, that are expected to replace touch-based ones after the emergence of the pandemic issues, being a more hygienic and safer solution. It is, therefore, important to support

CRediT authorship contribution statement

Ariel Caputo: Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Andrea Giachetti: Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Simone Soso: Methodology, Software. Deborah Pintani: Methodology, Software. Andrea D’Eusanio: Methodology, Software, Writing – original draft. Stefano Pini: Methodology, Software, Writing – original draft. Guido Borghi: Methodology, Software, Writing –

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (23)

A. Caputo et al.
Sfinge 3d: a novel benchmark for online detection and recognition of heterogeneous hand gestures from 3d fingers’ trajectories
Computers & Graphics
(2020)
F.M. Caputo et al.
Online Gesture Recognition
Eurographics Workshop on 3D Object Retrieval
(2019)
Q. De Smedt et al.
Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset
3DOR-10th Eurographics Workshop on 3D Object Retrieval
(2017)
L. Fang et al.
Deep learning-based point-scanning super-resolution imaging
bioRxiv
(2019)
S. Han et al.
Megatrack: monochrome egocentric articulated hand-tracking for virtual reality
ACM Transactions on Graphics (TOG)
(2020)
K. Hara et al.
Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet?
J. Howard et al.
Fastai: a layered API for deep learning
Information
(2020)
S. Ioffe et al.
Batch normalization: Accelerating deep network training by reducing internal covariate shift
(2015)
D.P. Kingma et al.
Adam: a method for stochastic optimization
M.C. Leong et al.
Semi-CNN architecture for effective spatio-temporal learning in action recognition
Applied Sciences
(2020)

J. Lin et al.

Tsm: Temporal shift module for efficient video understanding

Proceedings of the IEEE International Conference on Computer Vision

(2019)

Cited by (34)

SHREC 2022 track on online detection of heterogeneous gestures
2022, Computers and Graphics Pergamon
This paper presents the outcomes of a contest organized to evaluate methods for the online recognition of heterogeneous gestures from sequences of 3D hand poses. The task is the detection of gestures belonging to a dictionary of 16 classes characterized by different pose and motion features. The dataset features continuous sequences of hand tracking data where the gestures are interleaved with non-significant motions. The data have been captured using the Hololens 2 finger tracking system in a realistic use-case of mixed reality interaction. The evaluation is based not only on the detection performances but also on the latency and the false positives, making it possible to understand the feasibility of practical interaction tools based on the algorithms proposed. The outcomes of the contest’s evaluation demonstrate the necessity of further research to reduce recognition errors, while the computational cost of the algorithms proposed is sufficiently low.
A Collaborative Virtual Walkthrough of Matera’s Sassi Using Photogrammetric Reconstruction and Hand Gesture Navigation
2023, Journal of Imaging
Sign Language Recognition Method Based on Palm Definition Model and Multiple Classification
2022, Sensors
An Integrated Real-Time Hand Gesture Recognition Framework for Human–Robot Interaction in Agriculture
2022, Applied Sciences Switzerland
Human deep squat detection method based on MediaPipe combined with Yolov5 network
2022, Chinese Control Conference Ccc
Real-Time Indian Sign Language Recognition System using YOLOv3 Model
2021, Proceedings of the IEEE International Conference Image Information Processing

View all citing articles on Scopus

View full text

Computers & Graphics

Eurographics Workshop on 3D Object Retrieval

3DOR-10th Eurographics Workshop on 3D Object Retrieval

bioRxiv

ACM Transactions on Graphics (TOG)

Information

Applied Sciences

Proceedings of the IEEE International Conference on Computer Vision