Computers & Graphics

Volume 99, October 2021, Pages 201-211
Computers & Graphics

Special Section on 3DOR 2021
SHREC 2021: Skeleton-based hand gesture recognition in the wild

https://doi.org/10.1016/j.cag.2021.07.007Get rights and content

Highlights

  • 3D Shape Retrieval Challenge 2021 at 3DOR’21 Track on Skeleton-based Hand Gesture Recognition in the Wild.
  • New gesture dataset with 180 gestures sequences and 18 gestures dictionary.
  • Contest with 4 groups presenting their gesture recognition methods.
  • Report of results and performances for all the methods.

Abstract

Gesture recognition is a fundamental tool to enable novel interaction paradigms in a variety of application scenarios like Mixed Reality environments, touchless public kiosks, entertainment systems, and more. Recognition of hand gestures can be nowadays performed directly from the stream of hand skeletons estimated by software provided by low-cost trackers (Ultraleap) and MR headsets (Hololens, Oculus Quest) or by video processing software modules (e.g. Google Mediapipe). Despite the recent advancements in gesture and action recognition from skeletons, it is unclear how well the current state-of-the-art techniques can perform in a real-world scenario for the recognition of a wide set of heterogeneous gestures, as many benchmarks do not test online recognition and use limited dictionaries. This motivated the proposal of the SHREC 2021: Track on Skeleton-based Hand Gesture Recognition in the Wild. For this contest, we created a novel dataset with heterogeneous gestures featuring different types and duration. These gestures have to be found inside sequences in an online recognition scenario. This paper presents the result of the contest, showing the performances of the techniques proposed by four research groups on the challenging task compared with a simple baseline method.

Introduction

The recognition of gestures based on hand skeleton tracking is becoming the default interaction method for the new generation of Virtual Reality (VR) and Mixed Reality (MR) devices like Oculus Quest and Microsoft Hololens, implementing specific, advanced solutions [5], [16]. Low-cost hand tracking devices with good performances are available since 2010 [19] and are used in several application domains and research works. Real-time hand pose tracking is now possible from single-camera input using Google tools [22]. It is, therefore, extremely likely that most of the future hand gesture recognition tools will work directly on the hand skeleton poses and not on RGB or depth images. These facts strongly motivate research efforts aimed at the development of such tools. In practical application scenarios, these gesture recognizers need to work in real-time and to be able to detect and correctly label gestures ”in the wild” within a continuous sequence of hand movements.
Several methods have been recently proposed in the literature for the skeleton-based gesture recognition task. However, as pointed out in [1], current available benchmarks that focus on online-recogntion scenarios are limited. Many of them do not test recognizers in an online setting or evaluate the methods on limited vocabularies and not including many gesture types. Hand gestures, in fact, can be classified into different types according to their distinctive features. Some gestures are static, characterized by keeping a fixed hand pose for a minimum amount of time. Others are dynamic and characterized by a single trajectory with the hand pose that does not change or it is not semantically relevant. Others are dynamic and characterized not only by a global motion, but also by the evolution of fingers’ articulation over time.
Previous contests organized on skeleton-based gesture recognition were limited to offline recognition (SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset [3]) or featured a very limited dictionary of gestures (SHREC 2019 track on online gesture detection [2]).
For this reason, we created a novel dataset including 18 gesture classes belonging to different types. A subset of 7 classes are static, characterized by a hand pose kept fixed for at least one second (One, Two, Three, Four, OK, Menu, Pointing). The remaining ones are dynamic, 5 coarse, characterized by a single global trajectory of the hand (Left, Right, Circle, V, Cross) and 6 fine, characterized by variations in the fingers’ articulation (Grab, Pinch, Tap, Deny, Knob, Expand). Fig. 1 shows the gestures’ templates. A peculiarity of the data collected is that the gestures are executed within long sequences of hand gesticulation, as they were captured during generic user interaction.
Given the dataset, we proposed an online recognition task within the Eurographics SHREC 2021 framework. This paper reports on the outcomes of the contest’s result. The paper is organized as follows: Section 2 presents the novel dataset, Section 3 the proposed task and the evaluation method, Section 4 presents the groups participating in the contest and the methods proposed together with a baseline method.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Dataset creation

The dataset created for the contest is a collection of 180 gesture sequences. Each sequence, captured using a Leap Motion Device, features either 3, 4, or 5 gestures interleaved with non-significant gesticulation. The dictionary used consists of 18 classes of gestures, each appearing in the dataset an equal amount of time (40 occurrences per class).
Gestures were performed by five different subjects in pre-determined sequences. The execution of the dictionary gestures followed specific templates

Task proposed and evaluation

The goal of the participants was to detect correctly the gestures included in the sequence with an online detection approach. The gesture database captured was split as described into a training set with associated annotations of gesture time stamp and labels, that could be used to train the detection algorithms and a test set with no annotations available. Participants had to provide a list of the gestures detected in the test set with associated labels, start time stamps and end timestamps.

Participants and methods

Five research groups were registered for the contest and sent results, but one retired after the evaluation. Each group sent up to three annotation files that were obtained with different methods or parameters’ settings. The methods are described in the following subsection, together with the simple technique that we used as baseline.

Evaluation results

A summary of the results for each group, averaged over all the gestures, is presented in Table 1. The table also shows the results of the execution time of the methods by reporting the Total Time and the Classification Time. The first is a measure of the time each method takes to compute results for the entire test set (i.e. all the sequences), the second, is a measure of the average time needed, for each method, to perform a single gesture classification. Times show that all the methods are

Discussion

The evaluation outcomes provide useful insights for the design of online gesture recognizers usable ”in the wild”. The techniques tested provide promising scores given a limited number of annotated sequences for training. Given the short amount of time available for the contest this is a good result.
A nice aspect of the submissions received is that the proposed methods are quite different from each other and are exemplars of the principal network-based approaches proposed in the literature for

Conclusions

The development of effective and flexible gesture recognizers able to detect and correctly classify hand gestures of different kinds is fundamental not only to enable advanced user interfaces for Virtual and Mixed Reality applications, but also, for example, to enable the realization of touchless interfaces like public kiosks, that are expected to replace touch-based ones after the emergence of the pandemic issues, being a more hygienic and safer solution. It is, therefore, important to support

CRediT authorship contribution statement

Ariel Caputo: Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Andrea Giachetti: Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Simone Soso: Methodology, Software. Deborah Pintani: Methodology, Software. Andrea D’Eusanio: Methodology, Software, Writing – original draft. Stefano Pini: Methodology, Software, Writing – original draft. Guido Borghi: Methodology, Software, Writing –

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (23)

  • A. Caputo et al.

    Sfinge 3d: a novel benchmark for online detection and recognition of heterogeneous hand gestures from 3d fingers’ trajectories

    Computers & Graphics

    (2020)
  • F.M. Caputo et al.

    Online Gesture Recognition

    Eurographics Workshop on 3D Object Retrieval

    (2019)
  • Q. De Smedt et al.

    Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset

    3DOR-10th Eurographics Workshop on 3D Object Retrieval

    (2017)
  • L. Fang et al.

    Deep learning-based point-scanning super-resolution imaging

    bioRxiv

    (2019)
  • S. Han et al.

    Megatrack: monochrome egocentric articulated hand-tracking for virtual reality

    ACM Transactions on Graphics (TOG)

    (2020)
  • K. Hara et al.

    Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet?

  • J. Howard et al.

    Fastai: a layered API for deep learning

    Information

    (2020)
  • S. Ioffe et al.

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    (2015)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

  • M.C. Leong et al.

    Semi-CNN architecture for effective spatio-temporal learning in action recognition

    Applied Sciences

    (2020)
  • J. Lin et al.

    Tsm: Temporal shift module for efficient video understanding

    Proceedings of the IEEE International Conference on Computer Vision

    (2019)
  • Cited by (34)

    View all citing articles on Scopus
    View full text