학술논문

EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset
Document Type
Periodical
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Trans. Pattern Anal. Mach. Intell. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 45(6):6783-6793 Jun, 2023
Subject
Computing and Processing
Bioengineering
Task analysis
Artificial intelligence
Visualization
Synchronization
Natural languages
Computer vision
Education
Egocentric
multi-modal data
EgoCom
communication
turn-taking
human-centric
embodied intelligence
Language
ISSN
0162-8828
2160-9292
1939-3539
Abstract
Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants’ egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5 percent of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79 percent relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.