학술논문

A Hybrid DFSMN and Mamba Architecture for Low Bitrate Neural Speech Coding
Document Type
Conference
Source
2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP) Chinese Spoken Language Processing (ISCSLP), 2024 IEEE 14th International Symposium. :1-5 Nov, 2024
Subject
Computing and Processing
Signal Processing and Analysis
Speech codecs
Training
Time-frequency analysis
Speech coding
Bit rate
Vectors
Real-time systems
Decoding
Speech processing
Spectrogram
neural speech coding
Mamba
DFSMN
Language
Abstract
In this paper, we proposed a novel low bitrate neural speech codec based on sequence modeling networks. The proposed method consists of a convolution-based encoder and decoder, a DFSMN-Mamba module, and a vector quantizer. In the proposed method, a DFSMN-Mamba module is designed by combining Deep Feedforward Sequential Memory Network (DFSMN) with selective state space model Mamba, which is used to model the input features in parallel in both time and frequency dimensions. An adversarial loss is used to train the entire codec framework, which enables compression of speech waveforms into compact discrete representations at low bitrates. Experimental results show that the proposed method achieves better performance than the baseline in both subjective and objective evaluation.