학술논문

Coordinated Analysis of Heterogeneous Monitor Data in Enterprise Clouds for Incident Response
Document Type
Conference
Source
2019 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) Software Reliability Engineering Workshops (ISSREW), 2019 IEEE International Symposium on. :53-58 Oct, 2019
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Power, Energy and Industry Applications
Transportation
cloud computing
log analysis
reliability
incident response
log clustering
AIOps
Language
Abstract
During incident analysis and response, enterprise cloud administrators want to use as much of their generated monitor data as possible. However, the reality is that decisions are often dictated by the tools actually available to automatically process the monitor data, rather than by an understanding of the relevance of the data for incident response. The significant manual effort and domain expertise required to process diverse cloud monitors means that much monitor data remain unexamined. We propose a framework for simplifying the complexity of data analysis for incident response. Our framework enables coordinated analysis of both metric (numerical) data and log (semi-structured, textual) data and exposes salient features within those data. As a foundation for the framework, we define a taxonomy for fields within monitor data based on insights gained from analyzing logs and metrics collected from all levels of an experimental platform-as-a-service (PaaS) cloud (EPC). Using the taxonomy, we lay out a method for semi-automated feature extraction and discovery across heterogeneous monitors. We then describe a method for feature clustering to promote effective analysis of the data, and to remove redundant and uninformative features. We discuss the application of our framework for incident response within the EPC, including root cause analysis.