학술논문

Too Large; Data Reduction for Vision-Language Pre-Training
Document Type
Conference
Source
2023 IEEE/CVF International Conference on Computer Vision (ICCV) ICCV Computer Vision (ICCV), 2023 IEEE/CVF International Conference on. :3124-3134 Oct, 2023
Subject
Computing and Processing
Signal Processing and Analysis
Training
Computer vision
Image coding
Computational modeling
Redundancy
Noise measurement
Task analysis
Language
ISSN
2380-7504
Abstract
This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called ${\color {Purple}{TL;DR}}$, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, ${\color {Purple}{TL;DR}}$ enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, ${\color {Purple}{TL;DR}}$ can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.82M to 0.67M (~24%) and noisy YFCC15M from 15M to 2.5M (~16.7%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by ${\color {Purple}{TL;DR}}$ can perform similar or even better results compared with training on the full-scale dataset 1 .