학술논문

Can Large Language Models Understand Spatial Audio?

Document Type

Working Paper

Author

Tang, Changli; Yu, Wenyi; Sun, Guangzhi; Chen, Xianzhao; Tan, Tian; Li, Wei; Zhang, Jun; Lu, Lu; Ma, Zejun; Wang, Yuxuan; Zhang, Chao

Source

Subject

Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing

Language

Abstract

This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and localisation-informed speech extraction (LSE), achieving notable progress in each task. For SSL, our approach achieves an MAE of $2.70^{\circ}$ on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about $6.60^{\circ}$. Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of adapting LLMs to grasp physical audio concepts, paving the way for LLM-based agents in 3D environments.
Comment: Accepted at Interspeech 2024

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송