학술논문

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Document Type

Conference

Author

Maurya, Avinash; Nicolae, Bogdan; Rafique, M. Mustafa; Elsayed, Amr M.; Tonellot, Thierry; Cappello, Franck

Source

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) HIPC High Performance Computing, Data, and Analytics (HiPC), 2022 IEEE 29th International Conference on. :262-271 Dec, 2022

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Checkpointing
Costs
Runtime
Art
Memory management
Graphics processing units
Pins
GPU checkpointing
multi-level caching
fast initialization

Language

ISSN

2640-0316

Abstract

While many HPC applications are known to have long runtimes, this is not always because of single large runs: in many cases, this is due to ensembles composed of many short runs (runtime in the order of minutes). When each such run needs to checkpoint frequently (e.g. adjoint computations using a checkpoint interval in the order of milliseconds), it is important to minimize both checkpointing overheads at each iteration, as well as initialization overheads. With the rising popularity of GPUs, minimizing both overheads simultaneously is challenging: while it is possible to take advantage of efficient asynchronous data transfers between GPU and host memory, this comes at the cost of high initialization overhead needed to allocate and pin host memory. In this paper, we contribute with an efficient technique to address this challenge. The key idea is to use an adaptive approach that delays the pinning of the host memory buffer holding the checkpoints until all memory pages are touched, which greatly reduces the overhead of registering the host memory with the CUDA driver. To this end, we use a combination of asynchronous touching of memory pages and direct writes of checkpoints to untouched and touched memory pages in order to minimize end-to-end checkpointing overheads based on performance modeling. Our evaluations show a significant improvement over a variety of alternative static allocation strategies and state-of-art approaches.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송