학술논문

High-Performance Architecture Using Fast Dynamic Reconfigurable Accelerators
Document Type
Periodical
Source
IEEE Transactions on Very Large Scale Integration (VLSI) Systems IEEE Trans. VLSI Syst. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on. 26(7):1209-1222 Jul, 2018
Subject
Components, Circuits, Devices and Systems
Computing and Processing
Field programmable gate arrays
Computer architecture
Transistors
Logic gates
Acceleration
Task analysis
Accelerator (ACC)
dynamic reconfiguration
field-programmable gate array (FPGA)
gem5
monolithic 3-D (3Dm)
platform for ACC-rich architectural design and exploration (PARADE)
reconfigurable computing
vertical slit FET (VeSFET)
Language
ISSN
1063-8210
1557-9999
Abstract
System accelerators (ACCs) improve performance and break power and utilization walls. They can be implemented by fixed-function hard macros or reconfigurable logic such as field-programmable gate arrays (FPGAs). For systems running various applications, dynamic reconfigurable ACCs offer a very attractive feature; however, the reconfiguration time is an unavoidable overhead. This paper proposes high-performance architecture with fast dynamic reconfigurable FPGA ACCs (F-RACCs) based on a novel bitstream reprogramming method, which is feasible by using emerging technologies. The architecture includes CPU cores, caches, memories, ACCs, and network-on-chips. A portion of the computing tasks can be offloaded from CPUs to ACCs to improve performance. The ACCs can be reprogrammed rapidly to accommodate various functions required by wide spectrum of applications. The performance is evaluated by platform for ACC-rich architectural design and exploration, a gem5-based cycle-accurate full-system simulation platform. The 11 benchmark applications from different domains are evaluated. Comparing with systems using conventional FPGA ACCs partially configured using fastest configuration speed; this architecture improves system performance on all applications and achieves maximum $1.31\times $ and $2.82\times $ speedup using 1 and 12 ACC instances, respectively. It achieves maximum speedup of $94.93\times $ with one and $565.12\times $ with 12 F-RACCs over CPU software path with no ACCs.