학술논문

On Speaker Attribution with SURT

Document Type

Working Paper

Author

Raj, Desh; Wiesner, Matthew; Maciejewski, Matthew; Garcia-Perera, Leibny Paola; Povey, Daniel; Khudanpur, Sanjeev

Source

Subject

Electrical Engineering and Systems Science - Audio and Speech Processing
Computer Science - Sound

Language

Abstract

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework further by proposing methods to perform speaker-attributed transcription with SURT, for both short mixtures and long recordings. We achieve this by adding an auxiliary speaker branch to SURT, and synchronizing its label prediction with ASR token prediction through HAT-style blank factorization. In order to ensure consistency in relative speaker labels across different utterance groups in a recording, we propose "speaker prefixing" -- appending each chunk with high-confidence frames of speakers identified in previous chunks, to establish the relative order. We perform extensive ablation experiments on synthetic LibriSpeech mixtures to validate our design choices, and demonstrate the efficacy of our final model on the AMI corpus.
Comment: 8 pages, 6 figures, 6 tables. Submitted to Odyssey 2024

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송