학술논문
Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance
Document Type
Conference
Source
2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) Science, Engineering and Business for Sustainable Development Goals (SEB-SDG), 2023 International Conference on. 1:1-6 Apr, 2023
Subject
Language
Abstract
Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.