학술논문

Memory-Based Augmentation Network for Video Captioning
Document Type
Periodical
Source
IEEE Transactions on Multimedia IEEE Trans. Multimedia Multimedia, IEEE Transactions on. 26:2367-2379 2024
Subject
Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Visualization
Decoding
Task analysis
Semantics
Transformers
Context modeling
Linguistics
Attention mechanism
image captioning
LSTM
memory network
video captioning
Language
ISSN
1520-9210
1941-0077
Abstract
Video captioning focuses on generating natural language descriptions according to the video content. Existing works mainly explore this multimodal learning with the paired source video and corresponding sentence, which have achieved competitive performances. Nonetheless, learning from video-description pair cannot capture implicit external knowledge, i.e., multiple visual context information and linguistic clues existing in the video-language dataset, which may limit the cognitive capability of the model to generate diverse descriptions. To this end, we propose a Memory-based Augmentation Network (MAN), in which a memory structure is designed to augment the current encoder-decoder framework by incorporating implicit external knowledge with a neural memory. Specifically, we first propose a visual memory for the encoder to store multiple visual contexts across videos in the dataset, which is utilized to obtain memory-augmented contextual features for the source video. In addition, a textual memory is introduced for the decoder to capture the external language clues across sentences in the dataset. It is adapted to capture memory-augmented language features in each time step. The proposed approach is able to capture comprehensive contextual understanding compared to the basic encoder-decoder framework, which is more compatible with the human cognitive process. Extensive experiments on three video captioning datasets including MSVD, MSR-VTT, and VATEX demonstrate the effectiveness of the proposed method.