학술논문

Audio-Visual Wake Word Spotting System for MISP Challenge 2021

Document Type

Conference

Author

Xu, Yanguang; Sun, Jianwei; Han, Yang; Zhao, Shuaijiang; Mei, Chaoyang; Guo, Tingwei; Zhou, Shuran; Xie, Chuandong; Zou, Wei; Li, Xiangang

Source

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022 - 2022 IEEE International Conference on. :9246-9250 May, 2022

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Visualization
Casting
Array signal processing
Conferences
Signal processing algorithms
Speech enhancement
Transformers
Audio-Visual
Wake Word Spotting
Attention
Far-Field

Language

ISSN

2379-190X

Abstract

This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based net-work which can be employed for fusion, such as transformer and conformer. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송