학술논문

Can Multimodal Pointer Generator Transformers Produce Topically Relevant Summaries?
Document Type
Conference
Source
2023 International Joint Conference on Neural Networks (IJCNN) Neural Networks (IJCNN), 2023 International Joint Conference on. :1-8 Jun, 2023
Subject
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Measurement
Visualization
Image segmentation
Image coding
Neural networks
Parallel processing
Transformers
Multimodal Summarization
Transformer
Multi-task Learning
Language
ISSN
2161-4407
Abstract
Due to the growth in demand for brief and pertinent multimedia material over the past few years, multimodal summarization has attracted a lot of study interest. Recently Transformers have been widely used for various sequence processing tasks due to their fast parallel processing ability compared to LSTMs. Although Multimodal Summarization (MS) has tractioned much research interest of late, a research gap exists in producing topic-relevant multimodal summaries. Since any summary deals with concise information, it should carry the essence of the topic from which it was derived. Further, due to the lack of alignment information among the images and the inter-modal segments, MS systems also face difficulty choosing appropriate pictorial summaries. To study these research questions, we propose a Multitask learning-based Multimodal Pointer Generator Transformer (MPGT), which utilizes the topic information of the samples to produce multimodal summaries. We also augment the popular MSMO dataset for this study with similar “On-Topic” and “Off-Topic” images. Our results show that inter-modal attention among images helps achieve better alignment in the visual modality and improves image precision scores. Our analysis also provides discussions on how we can further enhance topic-relevant MS systems.