학술논문

Bs-SpMM:Accelerate Sparse Matrix-Matrix Multiplication by Balanced Split Strategy on the GPU

Document Type

Conference

Author

Guo, Mingfeng; Wang, Yaobin; Gu, Yajun; Chen, Yufang; Liu, Huan; Chen, Huarong; Han, Dongxuan; Xu, Hengyang; Deng, Chunhua; Tang, Pingping; Huang, Qi

Source

IEEE INFOCOM 2023 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) Computer Communications Workshops (INFOCOM WKSHPS), IEEE INFOCOM 2023 - IEEE Conference on. :1-6 May, 2023

Subject

Communication, Networking and Broadcast Technologies
Instruction sets
Conferences
Memory management
Graphics processing units
System-on-chip
Sparse matrices
Task analysis
Sparse matrix
GPU
balanced split strategy
SpMM

Language

ISSN

2833-0587

Abstract

Sparse Matrix-Matrix Multiplication(SpMM) is a commonly utilized operation in various domains, particularly in the increasingly popular Graph Neural Networks(GNN) framework. The current GPU-based SpMM kernel mainly adopts the row splitting strategy, in which a warp in the GPU processes one row of a sparse matrix. However, due to the uneven distribution of non-zero elements in sparse matrices, this row-splitting method often suffers from two problems. 1) Load imbalance among warps 2) When dealing with long rows, a warp's parallel efficiency is low. We propose a balanced split strategy SpMM algorithm named Bs-SpMM. When storing sparse matrices, we use “part” instead of “row” as the granularity just as int8 is more efficient than int16. Its advantage is to ensure that the basic work task size is within a certain range and that there are enough threads in the warp to parallelize all non-zero tasks. When computing tasks on the GPU, the parallel parts are processed by balance, thus avoiding the problem of large differences in the number of non-zero elements from row to row. This approach naturally alleviates the load imbalance problem among warps. When doing memory transactions on the GPU, the temporary space required by the part is small enough that it can always be stored in on-chip shared memory, which will ensure better data locality. Non-zero elements in part are from the same row, which continues to keep coalesced memory access to dense matrice. On Nvidia Rtx 3070 GPU, our experimental results based on SNAP Matrix Collection show that Bs-SpMM achieves an average speedup of 1.40x and 1.49x compared to Nvidia cuSPARSE and state-of-the-art GE-SpMM, respectively.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송