학술논문

VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations
Document Type
Conference
Source
2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) HPCA High-Performance Computer Architecture (HPCA), 2021 IEEE International Symposium on. :921-934 Feb, 2021
Subject
Components, Circuits, Devices and Systems
Computing and Processing
Histograms
High performance computing
Memory management
Linear algebra
Hardware
Libraries
Sparse matrices
Sparse Algebra
Vector Computing
Scratchpad Memory
Language
ISSN
2378-203X
Abstract
Sparse matrix operations are critical kernels in multiple application domains such as High Performance Computing, artificial intelligence and big data. Vector processing is widely used to improve performance on mathematical kernels with dense matrices. Unfortunately, existing vector architectures do not cope well with sparse matrix computations, achieving much lower performance in comparison with their dense counterparts.To overcome this limitation, we present the Vector Indexed Architecture (VIA), a novel hardware vector architecture that accelerates applications with irregular memory access patterns such as sparse matrix computations. There are two main bottlenecks when computing with sparse matrices: irregular memory accesses and index matching. VIA addresses these two bottlenecks with a smart scratchpad that is tightly coupled to the Vector Functional Units within the core.Thanks to this structure, VIA improves locality for sparse-dense computations and improves the index matching search process for sparse computations. As a result, VIA achieves significant performance speedup over highly optimized state-of-the-art C++ algebra libraries. On average, VIA outperforms sparse matrix vector multiplication, sparse matrix addition and sparse matrix matrix multiplication kernels by 4.22 ×, 6.14 × and 6.00 ×, respectively, when evaluated over a thousand sparse matrices that arise in real applications. In addition, we prove the generality of VIA by showing that it can accelerate histogram and stencil applications by 4.5 × and 3.5 ×, respectively.