학술논문

Attention-Based Speech Recognition Using Gaze Information
Document Type
Conference
Source
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Automatic Speech Recognition and Understanding Workshop (ASRU), 2019 IEEE. :465-470 Dec, 2019
Subject
Signal Processing and Analysis
Speech recognition
Acoustics
Task analysis
Decoding
Hidden Markov models
Lips
Feature extraction
end-to-end speech recognition
attention
multi-modal
gaze-point
Language
Abstract
We assume that there is a correlation between an utterance and a corresponding gaze object, and propose a new paradigm of multi-modal end-to-end speech recognition using multimodal information, namely, utterances and corresponding gaze points. In our method, the system extracts acoustic features and corresponding images around gaze points, and inputs the information into the proposed attention-based multiple encoder-decoder networks. This makes it possible to integrate the two different modalities, and the performance of speech recognition is improved. To evaluate the proposed method, we prepared a simulation task of power-line control operations, and built a corpus that contains utterances and corresponding gaze points in the operations. We conducted an experimental evaluation using this corpus, and the results showed the reduction in the CER, suggesting the effectiveness of the proposed method in which acoustic features and gaze information are integrated.