학술논문

On integrating error detection into a fault diagnosis algorithm for massively parallel computers
Document Type
Conference
Source
Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium Computer performance and dependability Computer Performance and Dependability Symposium, 1995. Proceedings., International. :154-164 1995
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
Computer errors
Fault detection
Fault diagnosis
Concurrent computing
Fault tolerant systems
Scalability
Clustering algorithms
Hardware
Application software
Instruments
Language
Abstract
Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. We introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by , messages, on the application performance.ETX