학술논문

Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection
Document Type
Conference
Source
2021 IEEE/CVF International Conference on Computer Vision (ICCV) ICCV Computer Vision (ICCV), 2021 IEEE/CVF International Conference on. :15891-15900 Oct, 2021
Subject
Computing and Processing
Measurement
Visualization
Grounding
Triples (Data structure)
Image edge detection
Predictive models
Rendering (computer graphics)
Scene analysis and understanding
Action and behavior recognition
Vision + language
Visual reasoning and logical representation
Language
ISSN
2380-7504
Abstract
Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject predicate object triplet. Most SGGs silently exploit datasets' bias on relationships' context, i.e. its subject and object, to improve recall and neglect spatial and visual evidence, e.g. having seen a glut of data for person wearing shirt, they are overconfident that every person is wearing every shirt. Such imprecise predictions are mainly ascribed to the lack of negative examples for most relationships, which obstructs models from meaningfully learning predicates, even those that have ample positive examples. We first present an indepth investigation of the context bias issue to showcase that all examined state-of-the-art SGGs share the above vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner. The developed spatial common sense can be then distilled to a student SGG and substantially enhance its spatial reasoning ability. This Grounding Consistency Distillation (GCD) approach is model-agnostic and benefits from the superfluous unlabeled samples to retain the valuable context information and avert memorization of annotations. Furthermore, we demonstrate that current metrics disregard unlabeled samples, rendering themselves incapable of reflecting context bias, then we mine and incorporate during evaluation hard-negatives to reformulate precision as a reliable metric. Extensive experimental comparisons exhibit large quantitative - up to 70% relative precision boost on VG200 dataset - and qualitative improvements to prove the significance of our GCD method and our metrics towards refocusing graph generation as a core aspect of scene understanding. Code available at https://github.com/deeplab-ai/grounding-consistent-vrd.