학술논문

Multiview Video-Based 3-D Hand Pose Estimation
Document Type
Periodical
Source
IEEE Transactions on Artificial Intelligence IEEE Trans. Artif. Intell. Artificial Intelligence, IEEE Transactions on. 4(4):896-909 Aug, 2023
Subject
Computing and Processing
Cameras
Three-dimensional displays
Videos
Pose estimation
Solid modeling
Artificial intelligence
Video on demand
Dataset
hand pose estimation (HPE)
multiview
video
Language
ISSN
2691-4581
Abstract
Hand pose estimation (HPE) can be used for a variety of human–computer interaction applications, such as gesture-based control for physical or virtual/augmented reality devices. Recent works have shown that videos or multiview images carry rich information regarding the hand, allowing for the development of more robust HPE systems. In this article, we present the multiview video-based three-dimensional (3-D) hand (MuViHand) dataset, consisting of multiview videos of the hand along with ground-truth 3-D pose labels. Our dataset includes more than 402 000 synthetic hand images available in 4560 videos. The videos have been simultaneously captured from six different angles with complex backgrounds and random levels of dynamic lighting. The data has been captured from ten distinct animated subjects using 12 cameras in a semicircle topology where six tracking cameras only focus on the hand and the other six fixed cameras capture the entire body. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, and graph networks with U-Net architectures to estimate the final 3-D pose information. We perform extensive experiments and show the challenging nature of this new dataset as well as the effectiveness of our proposed method. Ablation studies show the added value of each component in MuViHandNet, as well as the benefit of having temporal and sequential information in the dataset.