학술논문

Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition
Document Type
Conference
Source
2023 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) Signal Processing, Communications and Computing (ICSPCC), 2023 IEEE International Conference on. :1-6 Nov, 2023
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Visualization
Speech recognition
Computer architecture
Generative adversarial networks
Robustness
Task analysis
Generative Adversarial Networks (GANs)
deep learning
audio visual speech recognition
Language
ISSN
2837-116X
Abstract
Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.