학술논문

An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data–Index Decoupling Workflow
Document Type
Periodical
Source
IEEE Transactions on Very Large Scale Integration (VLSI) Systems IEEE Trans. VLSI Syst. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on. 31(10):1537-1550 Oct, 2023
Subject
Components, Circuits, Devices and Systems
Computing and Processing
Filtering algorithms
Convolutional neural networks
Convolution
Matched filters
Kernel
Inference algorithms
Heuristic algorithms
Convolutional neural network (CNN)
digital signal processor (DSP) efficiency
input channel scheduling
sparse awareness
Language
ISSN
1063-8210
1557-9999
Abstract
To adapt to complex scenes and strict accuracy requirements, evolutions have unstoppably occurred in current convolutional neural networks (CNNs). However, these evolutions bring changes to filter size, convolution type, and sparsity, and such diversity leads to difficulties when adopting evolving CNNs in field-programmable gate array (FPGA)-based accelerators. This article proposes a dense-/sparse-aware CNN accelerator to achieve high PE utilization and configurability. First, a filter-based decomposition and clustering algorithm (FDCA) is proposed to change the various-sized filters into unified size filters. In addition, a sparse-aware filter transformation scheme (SFTS) is presented to dynamically eliminate invalid weights for sparse filters and accelerate dense filters. Based on the elimination of sparsity dependency, a hardware accelerator with a data–index decoupling workflow and an input channel schedule-distribution system is designed to take advantage of FDCA and SFTS. The proposed accelerator is implemented on a Xilinx ZCU102 platform at 300 MHz. With different CNN configurations, the digital signal processor (DSP) efficiencies for dense and unstructured sparse AlexNet and dense and structured sparse MobileNetV2 are 0.987, 2.025, 0.547, and 1.278 GOPS/DSP, respectively. Compared with previous dense- and sparse-based designs, the accelerator achieves up to a $4.263\times $ speedup in DSP efficiency.