학술논문

Cross-Domain Image Captioning with Discriminative Finetuning
Document Type
Conference
Source
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVPR Computer Vision and Pattern Recognition (CVPR), 2023 IEEE/CVF Conference on. :6935-6944 Jun, 2023
Subject
Computing and Processing
Training
Computer vision
Codes
Computational modeling
Semantics
Closed box
Reinforcement learning
Vision
language
reasoning
Language
ISSN
2575-7075
Abstract
Neural captioners are typically trained to mimic human-generated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to iden-tify such image among a set of candidates. We experiment with the popular ClipCap captioner; also replicating the main results with BLIP. In terms of similarity to ground-truth human descriptions, the captions emerging from dis-criminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human ref-erences more than those produced by the same captioner wihtout finetuning. We further show that, on the Concep-tual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task. 1 1 Our code is available at https://github.com/facebookresearch/EGG/tree/main/egg/zoo/discriminative_captioner.