학술논문

An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data–Index Decoupling Workflow

Document Type

Periodical

Author

Meng, Y.; Yang, C.; Xiang, S.; Wang, J.; Mei, K.; Geng, L.

Source

IEEE Transactions on Very Large Scale Integration (VLSI) Systems IEEE Trans. VLSI Syst. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on. 31(10):1537-1550 Oct, 2023

Subject

Components, Circuits, Devices and Systems
Computing and Processing
Filtering algorithms
Convolutional neural networks
Convolution
Matched filters
Kernel
Inference algorithms
Heuristic algorithms
Convolutional neural network (CNN)
digital signal processor (DSP) efficiency
input channel scheduling
sparse awareness

Language

ISSN

1063-8210
1557-9999

Abstract

To adapt to complex scenes and strict accuracy requirements, evolutions have unstoppably occurred in current convolutional neural networks (CNNs). However, these evolutions bring changes to filter size, convolution type, and sparsity, and such diversity leads to difficulties when adopting evolving CNNs in field-programmable gate array (FPGA)-based accelerators. This article proposes a dense-/sparse-aware CNN accelerator to achieve high PE utilization and configurability. First, a filter-based decomposition and clustering algorithm (FDCA) is proposed to change the various-sized filters into unified size filters. In addition, a sparse-aware filter transformation scheme (SFTS) is presented to dynamically eliminate invalid weights for sparse filters and accelerate dense filters. Based on the elimination of sparsity dependency, a hardware accelerator with a data–index decoupling workflow and an input channel schedule-distribution system is designed to take advantage of FDCA and SFTS. The proposed accelerator is implemented on a Xilinx ZCU102 platform at 300 MHz. With different CNN configurations, the digital signal processor (DSP) efficiencies for dense and unstructured sparse AlexNet and dense and structured sparse MobileNetV2 are 0.987, 2.025, 0.547, and 1.278 GOPS/DSP, respectively. Compared with previous dense- and sparse-based designs, the accelerator achieves up to a $4.263\times $ speedup in DSP efficiency.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송