학술논문

Document Type

Periodical

Author

Ganapathi, R.B.; Gopalakrishnan, A.; McGuire, R.W.

Source

IEEE Transactions on Multi-Scale Computing Systems IEEE Trans. Multi-Scale Comp. Syst. Multi-Scale Computing Systems, IEEE Transactions on. 4(4):749-757 Jan, 2018

Subject

Communication, Networking and Broadcast Technologies
Computing and Processing
Components, Circuits, Devices and Systems
Performance evaluation
High performance computing
Bandwidth
Programming
Benchmark testing
Fabric
high performance computing
infiniband
MPI
NUMA
OFI
performance
process affinity
topology

Language

ISSN

2332-7766
2372-207X

Abstract

High Performance Computing (HPC) applications have demanding need for hardware resources such as processor, memory, and storage. Applications in the area of Artificial Intelligence and Machine Learning are taking center stage in HPC, which is driving demand for increasing compute resources per node which in turn is pushing bandwidth requirement between the compute nodes. New system design paradigms exist where deploying a system with more than one high performance IO device per node provides benefits. The number of I/O devices connected to the HPC node can be increased with PCIe switches and hence some of the HPC nodes are designed to include PCIe switches to provide a large number of PCIe slots. With multiple IO devices per node, application programmers are forced to consider HPC process affinity to not only compute resources but extend this to include IO devices. Mapping of process to processor cores and the closest IO device(s) increases complexity due to three way mapping and varying HPC node architectures. While operating systems perform reasonable mapping of process to processor core(s), they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is attached to the compute node. This paper is an extended version of our work published in [1] . Our previous work provided solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. In this paper, we extend the affinity solution to enable OpenFabric Interfaces (OFI) which is a generic HPC API designed as part of the OpenFabrics Alliance that enables wider HPC programming models and applications supported by various HPC fabric vendors. We present a solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. MPI continues to be the dominant programming model for HPC and hence we provide evaluation with MPI based micro benchmarks. Our solution is then extended to OpenFabric Interfaces which supports other HPC programming models such as SHMEM, GASNet, and UPC. We propose a solution to solve NUMA issues at the lower level of the software stack that forms the runtime for MPI and other programming models independent of HPC applications. Our experiments are conducted on a two node system where each node consists of two socket Intel Xeon servers, attached with up to four Intel Omni-Path fabric devices connected over PCIe. The performance benefits seen by applications by affinitizing processes with best possible network device is evident from the results where we notice up to 40 percent improvement in uni-directional bandwidth, 48 percent bi-directional bandwidth, 32 percent improvement in latency measurements, and up to 40 percent improvement in message rate with OSU benchmark suite. We also extend our evaluation to include OFI operations and an MPI benchmark used for Genome assembly. With OFI Remote Memory Access (RMA) operations we see a bandwidth improvement of 32 percent for fi_read and 22 percent with fi_write operations, and also latency improvement of 15 percent for fi_read and 14 percent for fi_write. K-mer MMatching Interface HASH benchmark shows an improvement of up to 25 percent while using local network device versus using a network device connected to remote Xeon socket.

부산대학교 도서관

Online Access

메일 발송