학술논문

AV-TAD: Audio-Visual Temporal Action Detection With Transformer

Document Type

Conference

Author

Li, Yangcheng; Yu, Zefang; Xiang, Suncheng; Liu, Ting; Fu, Yuzhuo

Source

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Visualization
Signal processing
Transformers
Acoustics
Decoding
Task analysis
Speech processing
Temporal Action Detection
Multi-modal
Transformer

Language

ISSN

2379-190X

Abstract

As an important and challenging task in video understanding, Temporal Action Detection (TAD) has been deeply studied in recent years. However, current works mainly tackle this task with visual information, while neglecting to explore the potential of the audio modality. To address this challenge, in this paper, we propose a simple yet effective AudioVisual Temporal Action Detection Transformer named AV- TAD, which performs early fusion on audio and visual modalities in an end-to-end fashion. On top of it, a novel query formulation is introduced by directly adopting temporal segment coordinates as queries in Transformer decoder, thus allowing us to perform dynamic segment update layer-by-layer. To the best of our knowledge, this is the first attempt to investigate both audio and video feature with a multi-modal Transformer in TAD task. Extensive experiments on THUMOS14 dataset demonstrate that our proposed AV-TAD can outperform the previous methods by a clear margin.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송