학술논문

Recurrent unit augmented memory network for video summarisation
Document Type
article
Source
IET Computer Vision, Vol 17, Iss 6, Pp 710-721 (2023)
Subject
computer vision
neural nets
Computer applications to medicine. Medical informatics
R858-859.7
Computer software
QA76.75-76.765
Language
English
ISSN
1751-9640
1751-9632
Abstract
Abstract Video summarisation can relieve the pressure on video storage, transmission, archiving, and retrieval caused by the explosive growth of online videos in recent years. Most existing supervised video summarisation methods use convolutional neural network (CNN) or recurrent neural network (RNN) to model the temporal dependencies between video frames or video shots. CNN mainly focuses on local information, and RNN loses long‐term information when the input sequence is long, both of which have limited ability to obtain long‐range memory in the video. Therefore, a recurrent unit augmented memory network (RUAMN) for video summarisation is proposed, which effectively utilises the long‐term memory extraction ability of the end‐to‐end memory network (MemN2N) and solves the problem that MemN2N is insensitive to temporal sequence information. At the same time, the proposed RUAMN enhances the process of memory update between multiple computational steps (hops), and finally generates a meaningful video summarisation result. Specifically, the proposed RUAMN is mainly composed of the input module, the global‐and‐local sampling, the memory module and the output module. The input module uses a bidirectional GRU to obtain the forward and backward information of each video frame. Then the global‐and‐local sampling performs global sampling and local sampling on the output sequence of the input module respectively to obtain several shorter sequences, so that the memory modules can capture the fine‐grained relationship features between video frames more effectively. The memory module extracts the long‐term memory information in the feature sequence, and finally the frame‐level importance scores are predicted by the output module. Extensive experiments on benchmark datasets, that is, TVSum and SumMe, demonstrate the superiority of our method over several state‐of‐the‐art supervised video summarisation methods.