학술논문

Holistic Multi-Modal Memory Network for Movie Question Answering
Document Type
Periodical
Source
IEEE Transactions on Image Processing IEEE Trans. on Image Process. Image Processing, IEEE Transactions on. 29:489-499 2020
Subject
Signal Processing and Analysis
Communication, Networking and Broadcast Technologies
Computing and Processing
Knowledge discovery
Visualization
Videos
Hidden Markov models
Task analysis
Motion pictures
Semantics
Question answering
multi-modal learning
MovieQA
Language
ISSN
1057-7149
1941-0042
Abstract
Answering questions using multi-modal context is a challenging problem, as it requires a deep integration of diverse data sources. Existing approaches only consider a subset of all possible interactions among data sources during one attention hop. In this paper, we present a holistic multi-modal memory network (HMMN) framework that fully considers interactions between different input sources (multi-modal context and question) at each hop. In addition, to hone in on relevant information, our framework takes answer choices into consideration during the context retrieval stage. Our HMMN framework effectively integrates information from the multi-modal context, question, and answer choices, enabling more informative context to be retrieved for question answering. Experimental results on the MovieQA and TVQA datasets validate the effectiveness of our HMMN framework. Extensive ablation studies show the importance of holistic reasoning and reveal the contributions of different attention strategies to model performance.