학술논문

CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems
Document Type
Conference
Source
2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS) SRDS Reliable Distributed Systems (SRDS), 2016 IEEE 35th Symposium on. :51-60 Sep, 2016
Subject
Computing and Processing
Radiation detectors
Clustering algorithms
Entropy
Supercomputers
Mutual information
Detection algorithms
Resource management
anomaly detection
resource usage data
faults
detection
large-scale HPC systems
unsupervised
event logs
Language
ISSN
1060-9857
Abstract
The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.