학술논문

Audio Retrieval With Natural Language Queries: A Benchmark Study

Document Type

Periodical

Author

Koepke, A.S.; Oncescu, A.; Henriques, J.F.; Akata, Z.; Albanie, S.

Source

IEEE Transactions on Multimedia IEEE Trans. Multimedia Multimedia, IEEE Transactions on. 25:2675-2685 2023

Subject

Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Task analysis
Benchmark testing
Natural languages
Visualization
Metadata
Grounding
Visual databases
Audio retrieval
text-based retrieval
datasets

Language

ISSN

1520-9210
1941-0077

Abstract

The objectives of this work are cross-modal text-audio and audio-text retrieval , in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송