KOR

e-Article

Root Cause Analysis for Cloud-Native Applications
Document Type
Periodical
Source
IEEE Transactions on Cloud Computing IEEE Trans. Cloud Comput. Cloud Computing, IEEE Transactions on. 12(1):232-250 Jan, 2024
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
Cloud computing
Correlation
Semantics
Behavioral sciences
Trajectory
Measurement
Inference algorithms
Root cause analysis
event correlation
knowledge mining
cloud-native applications
Language
ISSN
2168-7161
2372-0018
Abstract
Root cause analysis (RCA) is a critical component in maintaining the reliability and performance of modern cloud applications. However, due to the inherent complexity of cloud environments, traditional RCA techniques become insufficient in supporting system administrators in daily incident response routines. This article presents an RCA solution specifically designed for cloud applications, capable of pinpointing failure root causes and recreating complete fault trajectories from the root cause to the effect. The novelty of our approach lies in approximating causal symptom dependencies by synergizing several symptom correlation methods that assess symptoms in terms of structural, semantic, and temporal aspects. The solution integrates statistical methods with system structure and behavior mining, offering a more comprehensive analysis than existing techniques. Based on these concepts, in this work, we provide definitions and construction algorithms for RCA model structures used in the inference, propose a symptom correlation framework encompassing essential elements of symptom data analysis, and provide a detailed description of the elaborated root cause identification process. Functional evaluation on a live microservice-based system demonstrates the effectiveness of our approach in identifying root causes of complex failures across multiple cloud layers.