학술논문

Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval
Document Type
Periodical
Source
IEEE Transactions on Neural Networks and Learning Systems IEEE Trans. Neural Netw. Learning Syst. Neural Networks and Learning Systems, IEEE Transactions on. 35(5):7150-7161 May, 2024
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
General Topics for Engineers
Semantics
Videos
Encoding
Natural languages
Learning systems
Shape
Fuses
Fine-grained fusion
Gaussian decay
query-adaptive
semantic representation
Language
ISSN
2162-237X
2162-2388
Abstract
Recently, a hierarchical fine-grained fusion mechanism has been proved effective in cross-modal retrieval between videos and texts. Generally, the hierarchical fine-grained semantic representations (video-text semantic matching is decomposed into three levels including global-event representation matching, action–relation representation matching, and local-entity representation matching) to be fused can work well by themselves for the query. However, in real-world scenarios and applications, existing methods failed to adaptively estimate the effectiveness of multiple levels of the semantic representations for a given query in advance of multilevel fusion, resulting in a worse performance than expected. As a result, it is extremely essential to identify the effectiveness of hierarchical semantic representations in a query-adaptive manner. To this end, this article proposes an effective query-adaptive multilevel fusion (QAMF) model based on manipulating multiple similarity scores between the hierarchical visual and text representations. First, we decompose video-side and text-side representations into hierarchical semantic representations consisting of global-event level, action-relation level, and local-entity level, respectively. Then, the multilevel representation of the video-text pair is aligned to calculate the similarity score for each level. Meanwhile, the sorted similarity score curves of the good semantic representation are different from the inferior ones, which exhibit a “cliff” shape and gradually decline (see Fig. 1 as an example). Finally, we leverage the Gaussian decay function to fit the tail of the score curve and calculate the area under the normalized sorted similarity curve as the indicator of semantic representation effectiveness, namely, the area of good semantic representation is small, and vice versa. Extensive experiments on three public benchmark video-text datasets have demonstrated that our method consistently outperforms the state-of-the-art (SoTA). A simple demo of QAMF will soon be publicly available on our homepage: https://github.com/Lab-ANT.