학술논문

Multimodality-guided Visual-Caption Semantic Enhancement
Document Type
Article
Source
In Computer Vision and Image Understanding December 2024 249
Subject
Language
ISSN
1077-3142
Abstract
Highlights •We build a new dataset with multimodal triples for multi-modality perception.•A fusion framework enhances caption semantics by combining visual and auditory data.•Extensive experiments validate our framework and confirm its effectiveness.•ChatGPT generates syntactic structures to demonstrate framework availability.