학술논문

Improving Multimodal Movie Scene Segmentation Using Mixture of Acoustic Experts

Document Type

Conference

Author

Lin, Meng-Han; Li, Jeng-Lin; Lee, Chi-Chun

Source

2022 30th European Signal Processing Conference (EUSIPCO) European Signal Processing Conference (EUSIPCO), 2022 30th. :6-10 Aug, 2022

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Robotics and Control Systems
Signal Processing and Analysis
Visualization
Computational modeling
Semantics
Multimedia computing
Machine learning
Signal processing
Motion pictures
Movie
Scene Segmentation
Mixture of Experts
Multimodal Attention
Audio

Language

ISSN

2076-1465

Abstract

Scenes are the most basic semantic units of a movie that are important as pre-processing for various multimedia computing technology. Previous scene segmentation studies have introduced constraints and alignment mechanisms to cluster low-level frames and shots based on the visual features and temporal properties. Recent researchers have extended by using multimodal semantic representations with the acoustic representations blindly extracted by a universal pretrained model. They tend to ignore the semantic meaning of audio and complex interaction between the audio and visual representations for scene segmentation. In this work, we introduce a mixture-of-audio-experts (MOAE) framework to integrate acoustic experts and multimodal experts for scene segmentation. The acoustic expert is learned to model different acoustic semantics, including speaker, environmental sounds, and other events. The MOAE optimizes the weights delicately among various multimodal experts and achieves a state-of-the-art 61.89% F1-score for scene segmentation. We visualize the expert weights in our framework to illustrate the complementary properties among diverse experts, leading to improvements for segmentation results.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송