학술논문

LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification

Document Type

Working Paper

Author

Cao, Di; Wang, Xianchen; Zhou, Junfeng; Zhang, Jiakai; Lei, Yanjing; Chen, Wenpeng

Source

Subject

Computer Science - Computation and Language
Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing

Language

Abstract

Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaining system performance. In this paper, we propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion at different levels. Extensive experiments are conducted on VoxCeleb dataset, the comparative results show that it has achieved an EER of 0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other mainstream speaker verification methods. In addition, complexity analysis further demonstrates that the proposed architecture has lower computational cost and faster inference speed.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송