학술논문

ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes
Document Type
Conference
Source
2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) WACV Applications of Computer Vision (WACV), 2024 IEEE/CVF Winter Conference on. :3512-3522 Jan, 2024
Subject
Computing and Processing
Training
Learning systems
Solid modeling
Three-dimensional displays
Annotations
Grounding
Neural networks
Algorithms
3D computer vision
Datasets and evaluations
Language
ISSN
2642-9381
Abstract
The two popular datasets ScanRefer [20] and ReferIt3D [5] connect natural language to real-world 3D scenes. In this paper, we curate a complementary dataset extending both the aforementioned ones. We associate all objects mentioned in a referential sentence with their underlying instances inside a 3D scene. In contrast, previous work did this only for a single object per sentence. Our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k referential sentences, covering 705 real-world scenes. We propose novel architecture modifications and losses that enable learning from this new type of data and improve the performance for both neural listening and language generation. For neural listening, we improve the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3% and 5.0%, respectively. For language generation, we improve the SoTA by 13.2 CIDEr points on the Nr3D benchmark. For both of these tasks, the new type of data is only used to improve training, but no additional annotations are required at inference time. Our introduced dataset is available on the project’s webpage at https://scanents3d.github.io/.