학술논문

HDSuper: High-Quality and High Computational Utilization Edge Super-Resolution Accelerator With Hardware-Algorithm Co-Design Techniques
Document Type
Periodical
Source
IEEE Transactions on Circuits and Systems I: Regular Papers IEEE Trans. Circuits Syst. I Circuits and Systems I: Regular Papers, IEEE Transactions on. 71(4):1679-1692 Apr, 2024
Subject
Components, Circuits, Devices and Systems
Superresolution
Hardware
Feature extraction
Convolution
Image reconstruction
Inference algorithms
Computational efficiency
Super-resolution
co-design
efficient mapping
high-quality image
ASIC
FPGA
Language
ISSN
1549-8328
1558-0806
Abstract
Super-resolution (SR) techniques have been employed to construct high-definition images from low-quality images. Various neural networks have demonstrated excellent image-reconstruction quality in SR accelerators. However, deploying SR networks on edge devices is limited by resources and power consumption induced by significant algorithm parameters, computation complexity, and external memory accesses. This work explores the hardware algorithm co-design techniques to provide an end-to-end platform with a lightweight super-resolution network (LSR) and an efficient, high-quality SR accelerator HDSuper. For algorithm design, the improved depth-wise separable convolution and pixelshuffle layers are developed to reduce network size and computation complexity by considering the hardware constraints. Also, the improved channel attention (CA) blocks enhance the image reconstruction quality. For hardware accelerator design, we design a unified computing core (UCC) combined with an efficient flattening-and-allocation (F-A) mapping strategy to support various operators with high computational utilization. In addition, we design the patch computing scheme to reduce the external memory access of the hardware architecture. Based on the evaluation, the proposed algorithm achieves high-quality image reconstruction with $37.44dB$ PSNR. Finally, the FPGA demonstration and ASIC layout under UMC 55nm are achieved with low power consumption ( $2.08 W$ and $152 mW$ ) under the lowest hardware resources compared to the state-of-the-art works.