학술논문

MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer
Document Type
Periodical
Source
IEEE Transactions on Emerging Topics in Computational Intelligence IEEE Trans. Emerg. Top. Comput. Intell. Emerging Topics in Computational Intelligence, IEEE Transactions on. 8(2):1756-1771 Apr, 2024
Subject
Computing and Processing
Training
Predictive models
Feature extraction
Visualization
Task analysis
Object detection
Transformers
Weakly-supervised training strategy
audio-visual saliency prediction
cross-modal transformer
feature reuse mechanism
two-stage training methodology
Language
ISSN
2471-285X
Abstract
Although various video saliency models have achieved considerable performance gains, existing deep learning-based audio-visual saliency prediction models are still in the early exploration stage. The major challenge is that there are relatively few audio-visual sequences with real human eye fixations collected under the audio-visual circumstance. To this end, this paper presents a novel multi-modal transformer-based class activation mapping (MTCAM) model in a weakly-supervised training manner to effectively alleviate the need of large-scale datasets for audio-visual saliency prediction. In particular, by using only video category labels in the video classification task, we propose to employ the class activation mapping based on multi-modal transformer, which follows a two-stage training methodology to extract the most discriminative regions. Such regions with strong discriminative ability are highly consistent with real human eye fixations. Meanwhile, we further devise an efficient feature reuse mechanism to reduce redundant computation and enable previously obtained features can provide effective guidance for downstream model learning. It is particularly noteworthy that this work is the first attempt to exploit the cross-modal transformer to focus on cross-modal interaction at the entire video and predict human eye fixations in a weakly-supervised training strategy. We conduct extensive experiments on several benchmark datasets to demonstrate that the proposed MTCAM model significantly outperforms other competitors. Furthermore, detailed ablation experiments are also performed to validate the effectiveness and rationality of each component in our proposed model.