학술논문

Memory-Based Augmentation Network for Video Captioning

Document Type

Periodical

Author

Jing, S.; Zhang, H.; Zeng, P.; Gao, L.; Song, J.; Shen, H.T.

Source

IEEE Transactions on Multimedia IEEE Trans. Multimedia Multimedia, IEEE Transactions on. 26:2367-2379 2024

Subject

Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Visualization
Decoding
Task analysis
Semantics
Transformers
Context modeling
Linguistics
Attention mechanism
image captioning
LSTM
memory network
video captioning

Language

ISSN

1520-9210
1941-0077

Abstract

Video captioning focuses on generating natural language descriptions according to the video content. Existing works mainly explore this multimodal learning with the paired source video and corresponding sentence, which have achieved competitive performances. Nonetheless, learning from video-description pair cannot capture implicit external knowledge, i.e., multiple visual context information and linguistic clues existing in the video-language dataset, which may limit the cognitive capability of the model to generate diverse descriptions. To this end, we propose a Memory-based Augmentation Network (MAN), in which a memory structure is designed to augment the current encoder-decoder framework by incorporating implicit external knowledge with a neural memory. Specifically, we first propose a visual memory for the encoder to store multiple visual contexts across videos in the dataset, which is utilized to obtain memory-augmented contextual features for the source video. In addition, a textual memory is introduced for the decoder to capture the external language clues across sentences in the dataset. It is adapted to capture memory-augmented language features in each time step. The proposed approach is able to capture comprehensive contextual understanding compared to the basic encoder-decoder framework, which is more compatible with the human cognitive process. Extensive experiments on three video captioning datasets including MSVD, MSR-VTT, and VATEX demonstrate the effectiveness of the proposed method.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송