학술논문

Pluribus—An operational fault-tolerant multiprocessor
Document Type
Periodical
Source
Proceedings of the IEEE Proc. IEEE Proceedings of the IEEE. 66(10):1146-1159 Oct, 1978
Subject
General Topics for Engineers
Engineering Profession
Aerospace
Bioengineering
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Fields, Waves and Electromagnetics
Geoscience
Nuclear Engineering
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Power, Energy and Industry Applications
Communication, Networking and Broadcast Technologies
Photonics and Electrooptics
Fault tolerance
ARPANET
Error correction
Fault tolerant systems
Fasteners
Hardware
Multiprocessing systems
Light emitting diodes
Microcomputers
Language
ISSN
0018-9219
1558-2256
Abstract
The authors describe the Pluribus multiprocessor system, outline several techniques used to achieve fault-tolerance, describe their field experience to date, and mention some potential applications. The Pluribus system places the major responsibility for recovery from failures on the software. Failing hardware modules are removed from the system, spare modules are substituted where available, and appropriate initialization is performed. In applications where the goal is maximum availability rather than totally fault-free operation, this approach represents a considerable savings in complexity and cost over traditional implementations. The software-based reliability approach has been extended to provide enror-handling and recovery mechanisms for the system software structures as well. A number of Pluribus systems have been built and are currently in operation. Experience with these systems has given us confidence in their performance and maintainability, and leads us to suggest other applications that might benefit from this approach.