학술논문

An Analysis of Resilience Techniques for Exascale Computing Platforms
Document Type
Conference
Source
2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) IPDPSW Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. :914-923 May, 2017
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Signal Processing and Analysis
Resilience
Checkpointing
Benchmark testing
Redundancy
Computational modeling
Resource management
exascale resilience
checkpoint restart
multilevel checkpointing
message logging
fault tolerance
Language
Abstract
With the increase in the complexity and number of nodes in large-scale high performance computing (HPC) systems, the probability of applications experiencing failures has increased significantly. As the computational demands of applications that execute on HPC systems increase, projections indicate that applications executing on exascale-sized systems are likely to operate with a mean time between failures (MTBF) of as little as a few minutes. A number of strategies for enabling fault resilience in systems of extreme sizes have been proposed in recent years. However, few studies provide performance comparisons for these resilience techniques. This work provides a comparison of four state-of-the-art HPC resilience techniques that are being considered for use in exascale systems. We explore the behavior of each resilience technique under simulated execution of a diverse set of applications varying in communication behavior and memory use. We examine how each resilience technique behaves as application size scales from what is considered large today through to exascale-sized applications. We further study the performance degradation that a large-scale system experiences from the overhead associated with each resilience technique as well as the application computation needed to continue execution when a failure occurs. Using the results from these analyses, we examine how application performance on exascale systems can be improved by allowing the system to select the optimal resilience technique for use in an application-specific manner, depending upon each application's execution characteristics.