학술논문

Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification
Document Type
Periodical
Author
Source
IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE/ACM Trans. Audio Speech Lang. Process. Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 31:1000-1012 2023
Subject
Signal Processing and Analysis
Computing and Processing
Communication, Networking and Broadcast Technologies
General Topics for Engineers
Bayes methods
Neural networks
Task analysis
Training
Deep learning
Computational modeling
Additives
Speaker verification
deep neural network
self-attention
speaker embedding
x-vectors
Language
ISSN
2329-9290
2329-9304
Abstract
Learning effective and discriminative speaker embeddings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi-head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.