학술논문

Video question answering via multi-granularity temporal attention network learning

Document Type

Conference

Author

Xiao, Shaoning; Li, Yimeng; Ye, Yunan; Zhao, Zhou; Xiao, Jun; Wu, Fei; Zhu, Jiang; Zhuang, Yueting

Source

Proceedings of the 10th International Conference on Internet Multimedia Computing and Service. :1-5

Subject

temporal co-attention
video question answering
visual information retrieval

Language

English

Abstract

This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which generates the answer according to video content and question. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network (MGTA-Net) that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. The effectiveness of our model is demonstrated on the large scale dataset.

Online Access

Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송