학술논문

Image-Text Connection: Exploring the Expansion of the Diversity Within Joint Feature Space Similarity Scores
Document Type
Periodical
Source
IEEE Access Access, IEEE. 11:123209-123222 2023
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Adaptation models
Transformers
Joining processes
Computational modeling
Visualization
Task analysis
Representation learning
Information retrieval
Text mining
CLIP
cosine similarity matrix
diversity
dual-modal
image classification
image/text retrieval
joint embedding space
Language
ISSN
2169-3536
Abstract
Cross-modal representation learning aims to learn a shared representation space where data from multiple modalities can be effectively compared, fused, and understood. This paper investigates the role of increased diversity in the similarity score matrix in enhancing the performance of the CLIP (Contrastive Language-Image Pretraining), a multi-modal learning model that establishes a connection between images and text within a joint embedding space. Two transforming approaches, sine and sigmoid (including two versions), are incorporated into the CLIP model to amplify larger values and diminish smaller values within the similarity matrix (logits). Hardware limitations are addressed using a more compact text encoder (DistilBERT) and a pre-trained ResNet50 image encoder. The proposed adaptations are evaluated on various benchmarks, including image classification and image/text retrieval tasks, using 10 benchmark datasets such as Food101, Flickr30k, and COCO. The performance of the adapted models is compared to the base CLIP model using Accuracy, mean per class, and Recall@k metrics. The results demonstrate improvements in Accuracy (up to 5.32% enhancement for the PatchCamelyon dataset), mean per class (up to 14.48% enhancement for the FGVCAircraft dataset), and retrieval precision (with an increase of up to 45.20% in Recall@1 for the COCO dataset), compared to the baseline algorithm (CLIP).