학술논문

A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Processor With Dynamic Structured Sparsity
Document Type
Periodical
Source
IEEE Journal of Solid-State Circuits IEEE J. Solid-State Circuits Solid-State Circuits, IEEE Journal of. 58(7):1885-1897 Jul, 2023
Subject
Components, Circuits, Devices and Systems
Engineered Materials, Dielectrics and Plasmas
Computing and Processing
Training
Hardware
Energy efficiency
Convolution
Random access memory
Memory management
Convolutional neural networks
Activation sparsity
CNNs
convolutional neural network (CNN) training processor
energy-efficient accelerator
weight sparsity (WS)
Language
ISSN
0018-9200
1558-173X
Abstract
Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process; 2) hardware-efficient channel gating for dynamic output activation sparsity; 3) dynamic weight sparsity (WS) based on group Lasso; and 4) gradient skipping based on the FP prediction error. We develop a custom instruction set architecture (ISA) to flexibly support different CNN topologies and training parameters. The 28-nm prototype chip demonstrates large improvements in floating-point operations (FLOPs) reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.