학술논문

FluentNet: End-to-End Detection of Stuttered Speech Disfluencies With Deep Learning
Document Type
Periodical
Source
IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE/ACM Trans. Audio Speech Lang. Process. Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 29:2986-2999 2021
Subject
Signal Processing and Analysis
Computing and Processing
Communication, Networking and Broadcast Technologies
General Topics for Engineers
Speech processing
Deep learning
Training
Benchmark testing
Tools
Speaker recognition
Residual neural networks
Attention
disfluency
deep learning
BLSTM
speech
stutter
squeeze-and-excitation
Language
ISSN
2329-9290
2329-9304
Abstract
Millions of people are affected by stuttering and other speech disfluencies, with the majority of the world having experienced mild stutters while communicating under stressful conditions. While there has been much research in the field of automatic speech recognition and language models, stutter detection and recognition has not received as much attention. To this end, we propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different stutter types. FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations, followed by a set of bidirectional long short-term memory layers that aid in learning effective temporal relationships. Lastly, FluentNet uses an attention mechanism to focus on the important parts of speech to obtain a better performance. We perform a number of different experiments, comparisons, and ablation studies to evaluate our model. Our model achieves state-of-the-art results by outperforming other solutions in the field on the publicly available UCLASS dataset. Additionally, we present LibriStutter: a stuttered speech dataset based on the public LibriSpeech dataset with synthesized stutters. We also evaluate FluentNet on this dataset, showing the strong performance of our model versus a number of baseline and state-of-the-art techniques.