학술논문

Audio Retrieval With Natural Language Queries: A Benchmark Study
Document Type
Periodical
Source
IEEE Transactions on Multimedia IEEE Trans. Multimedia Multimedia, IEEE Transactions on. 25:2675-2685 2023
Subject
Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Task analysis
Benchmark testing
Natural languages
Visualization
Metadata
Grounding
Visual databases
Audio retrieval
text-based retrieval
datasets
Language
ISSN
1520-9210
1941-0077
Abstract
The objectives of this work are cross-modal text-audio and audio-text retrieval , in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries.