학술논문

DQ-STP: An Efficient Sparse On-Device Training Processor Based on Low-Rank Decomposition and Quantization for DNN

Document Type

Periodical

Author

Li, B.; Zhang, D.; Zhao, P.; Wang, H.; Zhang, X.; Sun, H.; Zheng, N.

Source

IEEE Transactions on Circuits and Systems I: Regular Papers IEEE Trans. Circuits Syst. I Circuits and Systems I: Regular Papers, IEEE Transactions on. 71(4):1665-1678 Apr, 2024

Subject

Components, Circuits, Devices and Systems
Training
Tensors
Artificial neural networks
Hardware
Quantization (signal)
Throughput
Energy efficiency
Deep neural network
weight low-rank decomposition
quantization
sparsity exploitation
on-device training processor

Language

ISSN

1549-8328
1558-0806

Abstract

Due to the bottleneck problems such as scenario-varying application, significant data communication overhead and privacy protection between off-line training and on-line inference, intelligent edge devices capable of adaptively fine-tuning the deep neural network (DNN) models for specific tasks have become the most urgent need. However, the computational cost is intolerable for ordinary on-device training (ODT), which inspires us to explore an efficient ODT processor, named DQ-STP. In this paper, we leverage a series of optimization techniques using software-hardware co-design. On the one hand, the proposed design incorporates SVD-based low-rank decomposition, $2^{n}$ quantization and ACBN algorithm on the software side. This unifies the sparse computing mode of convolutional layers and enhancing weight sparsity. On the other hand, the proposed design effectively leverages data sparsity on the hardware side through four techniques: 1) The flag compressed sparse row is proposed to compress input feature maps and gradient maps. 2) A unified processing element (PE) array comprising shifters and adders is proposed to expedite forward and error propagation steps. 3) The PE arrays for error propagation and weight gradients generation are separated to enhance throughput. 4) A sparse alignment strategy is proposed to further enhance PE utilization. Through these software and hardware co-optimization, the proposed DQ-STP achieves an area efficiency and peak energy efficiency of 41.2 GOPS/mm2 and 90.63 TOPS/W. In comparison to state-of-the-art reference designs, the proposed DQ-STP demonstrates a $2.19\times $ improvement in normalized area efficiency and a $1.85\times $ enhancement in energy efficiency.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송