학술논문

Exploring Fusion of the Face and Voice Modalities Using CNN Features for a Better Performance
Document Type
Conference
Source
2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) Science, Engineering and Business for Sustainable Development Goals (SEB-SDG), 2023 International Conference on. 1:1-6 Apr, 2023
Subject
Computing and Processing
Engineering Profession
General Topics for Engineers
Power, Energy and Industry Applications
Video on demand
Databases
Face recognition
Pipelines
Speech recognition
Feature extraction
Speaker recognition
convolutional neural network
face recognition
fusion
speaker recognition
Language
Abstract
Face recognition and speaker recognition have gained attention in recent times in the large-scale space. Large-scale face databases such as VGG-Face and VGGFace2 have been created using a semi-automated pipeline and used to develop methods for face recognition achieving state-of-the-art performance. This is also true for speaker recognition with large-scale databases such as VoxCeleb and VoxCeleb2. Howbeit, these two modalities have been treated individually. Although some works have explored the fusion of both modalities, they have played in small-scale space. This work aims at creating a large-scale face and corresponding voice database from YouTube under unconstrained conditions with a size comparable to the earlier mentioned and explores the fusion of both face and voice modalities for recognition in the large-scale space. To this end, a face and corresponding voice database of Nigerians available on YouTube was created for 2,656 Nigerians containing 2,055,169 face images and 195 hours of voice recording using a semi-automated curation pipeline. Convolutional Neural Networks (CNNs) were used to perform face recognition and speaker recognition individually. This was followed by the use of CNN for a combination of both modalities achieving an Equal Error Rate (EER) more than 5 times lower than the best result in the individual cases.