학술논문

CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems

Document Type

Conference

Author

Gurumdimma, Nentawe; Jhumka, Arshad; Liakata, Maria; Chuah, Edward; Browne, James

Source

2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS) SRDS Reliable Distributed Systems (SRDS), 2016 IEEE 35th Symposium on. :51-60 Sep, 2016

Subject

Computing and Processing
Radiation detectors
Clustering algorithms
Entropy
Supercomputers
Mutual information
Detection algorithms
Resource management
anomaly detection
resource usage data
faults
detection
large-scale HPC systems
unsupervised
event logs

Language

ISSN

1060-9857

Abstract

The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송