
SDC is in the Eye of the Beholder: A Survey and Preliminary Study
Document Type
2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W) Dependable Systems and Networks Workshop, 2016 46th Annual IEEE/IFIP International Conference on. :72-76 Jun, 2016
Communication, Networking and Broadcast Technologies
Computing and Processing
Fault tolerance
Fault tolerant systems
Operating systems
Computer crashes
Electronic mail
Silent Data Corruption (SDC)
Application-specific Correctness
Silent data corruptions (SDCs) are one of the most critical issues in modern HPC systems, as they are "silent" by definition and raise no warnings to users and application developers that a calculation has been corrupted. A significant amount of effort has been made to characterize, detect, and tolerate SDCs. However, current approaches do not share the same understanding of SDC, hence it is not only difficult to evaluate their effectiveness, but also to compare with each other. This position paper argues that SDCs should be discussed at each layer of the system and are confined within the goal of the approach. We provide a preliminary result to differentiate data corruptions across system layers, and show that application-specific correctness checks can tolerate about 50% of the errors that appear in the application output.