학술논문

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Document Type

Working Paper

Author

Zhang, Haotian; You, Haoxuan; Dufter, Philipp; Zhang, Bowen; Chen, Chen; Chen, Hong-You; Fu, Tsu-Jui; Wang, William Yang; Chang, Shih-Fu; Gan, Zhe; Yang, Yinfei

Source

Subject

Computer Science - Computer Vision and Pattern Recognition

Language

Abstract

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
Comment: Preprint. 14 pages, 4 figures

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송