학술논문

Performance Prediction for Data-driven Workflows on Apache Spark
Document Type
Conference
Source
2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2020 28th International Symposium on. :1-8 Nov, 2020
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Analytical models
Computational modeling
Predictive models
Data models
Sparks
Task analysis
Complex systems
performance prediction
workflow applications
Spark
machine learning
Language
ISSN
2375-0227
Abstract
Spark is an in-memory framework for implementing distributed applications of various types. Predicting the execution time of Spark applications is an important but challenging problem that has been tackled in the past few years by several studies; most of them achieving good prediction accuracy on simple applications (e.g. known ML algorithms or SQL-based applications). In this work, we consider complex data-driven workflow applications, in which the execution and data flow can be modeled by Directly Acyclic Graphs (DAGs). Workflows can be made of an arbitrary combination of known tasks, each applying a set of Spark operations to their input data. By adopting a hybrid approach, combining analytical and machine learning (ML) models, trained on small DAGs, we can predict, with good accuracy, the execution time of unseen workflows of higher complexity and size. We validate our approach through an extensive experimentation on real-world complex applications, comparing different ML models and choices of feature sets.