학술논문

TFF-Codec: A High Fidelity End-to-End Neural Audio Codec
Document Type
Original Paper
Source
Circuits, Systems, and Signal Processing. :1-20
Subject
Audio codec
End to end neural network
Auto encoder
High fidelity audio generation
Language
English
ISSN
0278-081X
1531-5878
Abstract
Audio Coding has made significant progress with the development of deep neural networks. Recently, neural speech codecs based on vector quantized variational autoencoder have become increasingly popular among researchers due to their elegant design and superior performance, but their application to high bitrate audio coding has not been further expanded. In this paper, we propose a novel high fidelity end-to-end neural audio codec called time frequency fusion codec (TFF-Codec), which is capable of high-quality reconstruction of 32 kHz audio in the time–frequency domain at 48 and 64 kbps. In this paper, a dual-path time–frequency filtering module is proposed to capture the local structure of the spectrogram and the long-term time dependence between consecutive frames. The architecture of the proposed codec is composed of encoder, the time–frequency filtering module, vector quantizer and decoder. First, the input audio is fed into the encoder to obtain its potential representation. Then, it is modeled in the frequency domain in the time–frequency filtering module. Subsequently, it is further compressed by a vector quantizer. Finally, the reconstructed audio is obtained by the decoder. We also use a combination of multiple loss functions in TFF-Codec to ensure that the reconstructed audio is balanced in terms of objective metrics and subjective listening experience. To evaluate the performance of TFF-Codec, comparative experiments are conducted with the traditional audio codec Opus and several recent neural audio codecs. Both subjective and objective evaluation tests demonstrate the superiority of our proposed method.