학술논문

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Document Type

Conference

Author

Huang, W. Ronny; Chang, Shuo-Yiin; Sainath, Tara N.; He, Yanzhang; Rybach, David; David, Robert; Prabhavalkar, Rohit; Allauzen, Cyril; Peyser, Cal; Strohman, Trevor D.

Source

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Video on demand
Earth Observing System
Signal processing algorithms
Speech recognition
Signal processing
Real-time systems
Decoding
ASR
segmentation
decoding algorithms

Language

ISSN

2379-190X

Abstract

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the non-causal 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송