학술논문

A 28nm 11.2TOPS/W Hardware-Utilization-Aware Neural-Network Accelerator with Dynamic Dataflow

Document Type

Conference

Author

Du, Cheng-Yan; Tsai, Chieh-Fu; Chen, Wen-Ching; Lin, Liang-Yi; Chang, Nian-Shyang; Lin, Chun-Pin; Chen, Chi-Shi; Yang, Chia-Hsiang

Source

2023 IEEE International Solid-State Circuits Conference (ISSCC) Solid-State Circuits Conference (ISSCC), 2023 IEEE International. :1-3 Feb, 2023

Subject

Bioengineering
Components, Circuits, Devices and Systems
Computing and Processing
Deep learning
Convolution
Shape
Neural networks
Parallel processing
Benchmark testing
Energy efficiency

Language

ISSN

2376-8606

Abstract

With the rapid evolution of AI technology, various neural network structures have been developed for diverse applications. As a typical ease, Fig. 22.4.1 shows that the convolution (Conv) layer used in the convolutional neural networks (CNNs) features distinct shapes and types. Neural network accelerators with high peak energy efficiency have been demonstrated [1–4]. However, they usually suffer decreased hardware (mainly multiply-accumulate (MAC) units) utilization for various network structures, which reduces the attainable energy efficiency accordingly. To improve the MAC utilization, the Nvidia deep learning accelerator (NVDLA) [5] applies hardware parallelism along the channel direction, but the MAC utilization is still low for the shallow layers. According to our experiments, NVDLA achieves 23% MAC utilization in the worst case. A Scatter-Gather scheme [4] is utilized to mitigate the utilization drop for shallow layers by rearranging the input features (IF), but the improvement is limited. As depthwise convolution (Dwcv) has been widely used, the accompanying low MAC utilization also needs to be considered. Taking MobileNetV2 as an example, NVDLA only achieves 0.4% utilization for Dwcv. To address these critical issues, this work presents a utilization-aware neural network accelerator, which can dynamically change the level of parallelism along multiple dimensions to maximize the MAC utilization. The chip achieves $> 97.3{\%}$ MAC utilization on benchmark networks while delivering $4.7\times$ higher attainable energy efficiency than state-of-the-art designs [1–4].

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송