학술논문

The Self-adaptive and Topology-aware MPI_Bcast leveraging Collective offload on Tianhe Express Interconnect
Document Type
Conference
Source
2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) IPDPS Parallel and Distributed Processing Symposium (IPDPS), 2024 IEEE International. :791-801 May, 2024
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Distributed processing
Multicast algorithms
Semantics
Switches
Reliability engineering
Data transfer
Hardware
collectives
broadcast
NIC-based offload
Language
ISSN
1530-2075
Abstract
Large parallel applications have heavily used MPI (Massage Passing Interface) collectives that support portable and efficient group communication operations. MPI_Bcast is one of the most commonly used MPI collectives that broadcast data to all processes of the communication domain. However, traditional software-based broadcast algorithms fail to fully utilize modern interconnection networks’ advanced features such as offloading collectives to the network hardware for efficient group communications. Besides, the semantic gap between MPI_Bcast and hardware multicast of underlying interconnects presents challenges for offload-based algorithms to accelerate MPI_Bcast for a wide range of message sizes.In this paper, we propose a hardware-software co-design MPI_Bcast by efficiently leveraging the NIC-based collective offload provided by Tianhe-express interconnect, which completely precludes the involvement of CPU to accelerate message broadcast. We detail this broadcast mechanism that can be adaptively tuned to offload MPI_Bcast operations from the CPU to the NIC for various message and system sizes. In addition, we further propose a topology-aware broadcast design in conjunction with this offload method to significantly reduce the broadcast latency by constructing the optimal global inter-node communication tree. We implement and evaluate the proposed Tianhe-Express Offload-based Broadcast (TOB) design on Tianhe-2A and Tianhe-EP supercomputers. Extensive experiments have been conducted to evaluate TOB performance at both microbenchmark and application levels. Our solution offers up to 4.94x significant performance speedup at the microbenchmark level over state-of-the-art MPI libraries. For the application-level evaluation, our technique accelerates scientific applications by a maximum speedup of 1.34x.