학술논문

A Reconfigurable Processing Element for Multiple-Precision Floating/Fixed-Point HPC
Document Type
Periodical
Source
IEEE Transactions on Circuits and Systems II: Express Briefs IEEE Trans. Circuits Syst. II Circuits and Systems II: Express Briefs, IEEE Transactions on. 71(3):1401-1405 Mar, 2024
Subject
Components, Circuits, Devices and Systems
Training
Artificial neural networks
Energy efficiency
Hardware
Random access memory
Deep learning
Clocks
Multiple-precision
floating-point
fixed-point
PE
MAC
HPC
Language
ISSN
1549-7747
1558-3791
Abstract
High-performance computing (HPC) can facilitate deep neural network (DNN) training and inference. Previous works have proposed multiple-precision floating- and fixed-point designs, but most can only handle either one independently. This brief proposes a novel reconfigurable processing element (PE) supporting both energy-efficient floating-point and fixed-point multiply-accumulate (MAC) operations. This PE can support $9\times $ BFloat16 (BF16), 4 $\times $ half-precision (FP16), $4\times $ TensorFloat-32 (TF32) and $1\times $ single-precision (FP32) MAC operation with 100% multiplication hardware utilization in one clock cycle. Besides, it can also support 72 $\times $ INT2, 36 $\times $ INT4 and 9 $\times $ INT8 dot product plus one 32-bit addend. The design is realized in a 28nm-process at a 1.471GHz slow-corner clock frequency. Compared with state-of-the-art (SOTA) multiple-precision PEs, the proposed work exhibits the best energy efficiency of 834.35GFLOPS/W and 1761.41GFLOPS/W at TF32 and BF16 with at least 10 $\times $ and 4 $\times $ improvement, respectively, for deep learning training. Meanwhile, this design supports energy-efficient fixed-point computing with a small hardware overhead for deep learning inference.