학술논문

Speaker Turn Aware Similarity Scoring for Diarization of Speech-Based Cognitive Assessments
Document Type
Conference
Source
2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021 Asia-Pacific. :1299-1304 Dec, 2021
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Convolution
Training data
Information processing
Acoustic measurements
Data models
Acoustics
Microphones
Language
ISSN
2640-0103
Abstract
This paper proposes two enhancements to the con-ventional speaker diarization methods for speech-based Montreal cognitive assessments (MoCA). The enhancements address the technical challenges of MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent at-tention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Evaluations on an interactive dialog dataset for MoCA show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone mismatch scenarios. The results also show that the speaker-turn timestamps can be hypothesized, suggesting that the proposed enhancements are amendable to datasets without speaker timestamp information.