학술논문
A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Processor With Dynamic Structured Sparsity
Document Type
Periodical
Author
Source
IEEE Journal of Solid-State Circuits IEEE J. Solid-State Circuits Solid-State Circuits, IEEE Journal of. 58(7):1885-1897 Jul, 2023
Subject
Language
ISSN
0018-9200
1558-173X
1558-173X
Abstract
Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process; 2) hardware-efficient channel gating for dynamic output activation sparsity; 3) dynamic weight sparsity (WS) based on group Lasso; and 4) gradient skipping based on the FP prediction error. We develop a custom instruction set architecture (ISA) to flexibly support different CNN topologies and training parameters. The 28-nm prototype chip demonstrates large improvements in floating-point operations (FLOPs) reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.