학술논문

The Efficiency of Convolution on Gemmini Deep Learning Hardware Accelerator

Document Type

Conference

Author

Gookyi, Dennis Agyemanh Nana; Wilson, Michael; Ahiadormey, Roger Kwao; Asiedu, Derek Kwaku Pobi; Danquah, Paul; Gyaang, Raymond

Source

2023 IEEE AFRICON AFRICON, 2023 IEEE. :1-5 Sep, 2023

Subject

Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Performance evaluation
Deep learning
Rockets
Power demand
Software algorithms
Table lookup
Frequency measurement
deep learning
convolution
FPGA
systolic array
hardware accelerator
Gemmini
Rocket core

Language

ISSN

2153-0033

Abstract

The successful use of deep learning (DL) algorithms in a variety of applications is conceptually based on convolutions. Though convolution is a simple operation, it suffers from severe performance degradation when implemented in software. Recently, with the advancement of CMOS technology, the convolution operation in DL algorithms has been accelerated by being delegated to specialized hardware platforms such as Field Programmable Gate Array (FPGA) devices. On hardware platforms, the convolution operation can be implemented on a synthesizable processor core or custom hardware accelerators based on systolic arrays (SA). Choosing an optimal hardware implementation should not be done analytically but instead employ the use of tools for fast and accurate estimation of metrics such as execution cycles, hardware resource utilization, and power consumption. This work evaluates the efficiency of implementing the convolution operation on various SA dimensions (8 × 8, 16 × 16, and 32 × 32) on the open-source Gemmini DL hardware accelerator with a comparison to the synthesizable RISC-V Rocket processor core. In terms of execution cycles the 8 × 8, 16 × 16, and 32 × 32 Gemmini configurations offer speedups of 323×, 249×, and 204× relative to the Rocket core. This work shows that, unlike the General Matrix to Matrix Multiplication (GEMM), the performance of the convolution operation degrades by an average factor of 2 when the Gemmini SA is doubled. In terms of hardware resource utilization on the Zynq Ultrascale+ ZCU104 FPGA evaluation board, the area and power consumption increased by 3.1× and 2.7× when the Gemmini SA dimension is doubled. Overall, the 8 × 8 Gemmini SA dimension recorded the highest performance-per-area metric making it the most efficient for a popular convolution configuration.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송