학술논문

The Efficiency of Convolution on Gemmini Deep Learning Hardware Accelerator
Document Type
Conference
Source
2023 IEEE AFRICON AFRICON, 2023 IEEE. :1-5 Sep, 2023
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Performance evaluation
Deep learning
Rockets
Power demand
Software algorithms
Table lookup
Frequency measurement
deep learning
convolution
FPGA
systolic array
hardware accelerator
Gemmini
Rocket core
Language
ISSN
2153-0033
Abstract
The successful use of deep learning (DL) algorithms in a variety of applications is conceptually based on convolutions. Though convolution is a simple operation, it suffers from severe performance degradation when implemented in software. Recently, with the advancement of CMOS technology, the convolution operation in DL algorithms has been accelerated by being delegated to specialized hardware platforms such as Field Programmable Gate Array (FPGA) devices. On hardware platforms, the convolution operation can be implemented on a synthesizable processor core or custom hardware accelerators based on systolic arrays (SA). Choosing an optimal hardware implementation should not be done analytically but instead employ the use of tools for fast and accurate estimation of metrics such as execution cycles, hardware resource utilization, and power consumption. This work evaluates the efficiency of implementing the convolution operation on various SA dimensions (8 × 8, 16 × 16, and 32 × 32) on the open-source Gemmini DL hardware accelerator with a comparison to the synthesizable RISC-V Rocket processor core. In terms of execution cycles the 8 × 8, 16 × 16, and 32 × 32 Gemmini configurations offer speedups of 323×, 249×, and 204× relative to the Rocket core. This work shows that, unlike the General Matrix to Matrix Multiplication (GEMM), the performance of the convolution operation degrades by an average factor of 2 when the Gemmini SA is doubled. In terms of hardware resource utilization on the Zynq Ultrascale+ ZCU104 FPGA evaluation board, the area and power consumption increased by 3.1× and 2.7× when the Gemmini SA dimension is doubled. Overall, the 8 × 8 Gemmini SA dimension recorded the highest performance-per-area metric making it the most efficient for a popular convolution configuration.