학술논문

On Performance Modeling and Prediction for Spark-HBase Applications in Big Data Systems
Document Type
Conference
Source
ICC 2022 - IEEE International Conference on Communications Communications, ICC 2022 - IEEE International Conference on. :3685-3690 May, 2022
Subject
Communication, Networking and Broadcast Technologies
Machine learning algorithms
Machine learning
Predictive models
Big Data
Parallel processing
Prediction algorithms
Data models
Spark
HBase
big data
machine learning
representation learning
performance modeling and prediction
Language
ISSN
1938-1883
Abstract
Many large-scale applications in various business and scientific domains require both parallel computing and distributed data management for big data processing. One typical scenario is the use of the Spark computing engine to process a large amount of data managed by HBase in Hadoop. Such computing workflows provide an opportunity to optimize application performance through strategic resource allocation with suitable parameter settings. As such, it necessitates accurate modeling and prediction of application performance to provide an effective recommendation of optimal system configurations to end users. However, this is a challenging problem for multiple reasons, mainly the large parameter space and the dynamic interactions between different technology layers of big data systems. In this paper, we propose a class of regression-based machine learning models to predict the execution performance of Spark-HBase applications in Hadoop. We first explore and identify an exhaustive set of system parameters across multiple layers including Spark and HBase, and then conduct in-depth exploratory analysis of their effects on the execution time of Spark-HBase applications. Based on these analysis results, we design a performance predictor using regression-based machine learning algorithms. Experimental results show that the resulted predictor achieves high accuracy with different algorithms in comparison. The proposed approach can facilitate automatic system configurations and has potential to be applied to other similar systems for big data processing.