학술논문

L-Verse: Bidirectional Generation Between Image and Text

Document Type

Conference

Author

Kim, Taehoon; Song, Gwangmo; Lee, Sihaeng; Kim, Sangyun; Seo, Yewon; Lee, Soonyoung; Kim, Seung Hwan; Lee, Honglak; Bae, Kyunghoon

Source

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVPR Computer Vision and Pattern Recognition (CVPR), 2022 IEEE/CVF Conference on. :16505-16515 Jun, 2022

Subject

Computing and Processing
Representation learning
Training
Scalability
Computer architecture
Transformers
Robustness
Pattern recognition
Vision + language; Image and video synthesis and generation; Representation learning; Scene analysis and understanding

Language

ISSN

2575-7075

Abstract

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNetlK validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송