학술논문

A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Processor With Dynamic Structured Sparsity

Document Type

Periodical

Author

Venkataramanaiah, S.K.; Meng, J.; Suh, H.; Yeo, I.; Saikia, J.; Cherupally, S.K.; Zhang, Y.; Zhang, Z.; Seo, J.

Source

IEEE Journal of Solid-State Circuits IEEE J. Solid-State Circuits Solid-State Circuits, IEEE Journal of. 58(7):1885-1897 Jul, 2023

Subject

Components, Circuits, Devices and Systems
Engineered Materials, Dielectrics and Plasmas
Computing and Processing
Training
Hardware
Energy efficiency
Convolution
Random access memory
Memory management
Convolutional neural networks
Activation sparsity
CNNs
convolutional neural network (CNN) training processor
energy-efficient accelerator
weight sparsity (WS)

Language

ISSN

0018-9200
1558-173X

Abstract

Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process; 2) hardware-efficient channel gating for dynamic output activation sparsity; 3) dynamic weight sparsity (WS) based on group Lasso; and 4) gradient skipping based on the FP prediction error. We develop a custom instruction set architecture (ISA) to flexibly support different CNN topologies and training parameters. The 28-nm prototype chip demonstrates large improvements in floating-point operations (FLOPs) reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송