학술논문

When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach

Document Type

Conference

Author

Tao, M.; Bai, Bing; Lin, Haozhe; Wang, Heyuan; Wang, Yu; Luo, Lin; Fang, Lu

Source

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVPR Computer Vision and Pattern Recognition (CVPR), 2024 IEEE/CVF Conference on. :22119-22128 Jun, 2024

Subject

Computing and Processing
Visualization
Computer vision
Grounding
Computational modeling
Natural languages
Imaging
Benchmark testing

Language

ISSN

2575-7075

Abstract

Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless, recent advances in imaging technology have enabled the acquisition of gigapixel-level images, providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable, we introduce a novel dataset, named GigaGrounding, designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks, demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding, gigapixel-level resolution, significant variations in object scales, and the “multi-hop expressions”. Furthermore, we introduced a simple yet effective grounding approach, which employs a “glance-to-zoom-in” paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송