학술논문

Interaction Compass: Multi-Label Zero-Shot Learning of Human-Object Interactions via Spatial Relations
Document Type
Conference
Source
2021 IEEE/CVF International Conference on Computer Vision (ICCV) ICCV Computer Vision (ICCV), 2021 IEEE/CVF International Conference on. :8452-8463 Oct, 2021
Subject
Computing and Processing
Training
Visualization
Computer vision
Image recognition
Annotations
Computational modeling
Genomics
Transfer/Low-shot/Semi/Unsupervised Learning
Action and behavior recognition
Recognition and classification
Language
ISSN
2380-7504
Abstract
We study the problem of multi-label zero-shot recognition in which labels are in the form of human-object interactions (combinations of actions on objects), each image may contain multiple interactions and some interactions do not have training images. We propose a novel compositional learning framework that decouples interaction labels into separate action and object scores that incorporate the spatial compatibility between the two components. We combine these scores to efficiently recognize seen and unseen interactions. However, learning action-object spatial relations, in principle, requires bounding-box annotations, which are costly to gather. Moreover, it is not clear how to generalize spatial relations to unseen interactions. We address these challenges by developing a cross-attention mechanism that localizes objects from action locations and vice versa by predicting displacements between them, referred to as relational directions. During training, we estimate the relational directions as ones maximizing the scores of ground-truth interactions that guide predictions toward compatible action-object regions. By extensive experiments, we show the effectiveness of our framework, where we improve the state of the art by 2.6% mAP score and 5.8% recall score on HICO and Visual Genome datasets, respectively. 1