학술논문

Predictive Reliability and Fault Management in Exascale Systems : State of the Art and Perspectives

Document Type

Academic Journal

Author

Canal, Ramon; Hernandez, Carles; Tornero, Rafa; Cilardo, Alessandro; Massari, Giuseppe; Reghenzani, Federico; Fornaciari, William; Zapater, Marina; Atienza, David; Oleksiak, Ariel; PiĄtek, Wojciech; Abella, Jaume

Source

ACM Computing Surveys (CSUR). 53(5):1-32

Subject

HPC
exascale
failures
faults
prediction
reliability
supercomputing
survey

Language

English

ISSN

0360-0300
1557-7341

Abstract

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

Online Access

Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송