학술논문

Optimizing Near-Data Processing for Spark

Document Type

Conference

Author

Rachuri, Sri Pramodh; Gantasala, Arun; Emanuel, Prajeeth; Gandhi, Anshul; Foley, Robert; Puhov, Peter; Gkountouvas, Theodoros; Lei, Hui

Source

2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS) ICDCS Distributed Computing Systems (ICDCS), 2022 IEEE 42nd International Conference on. :636-646 Jul, 2022

Subject

Communication, Networking and Broadcast Technologies
Computing and Processing
Industries
Analytical models
Simulation
Prototypes
Data transfer
Libraries
Sparks
resource disaggregation
near-data processing
spark
pushdown
modeling

Language

ISSN

2575-8411

Abstract

Resource disaggregation (RD) is an emerging paradigm for data center computing whereby resource-optimized servers are employed to minimize resource fragmentation and improve resource utilization. Apache Spark deployed under the RD paradigm employs a cluster of compute-optimized servers to run executors and a cluster of storage-optimized servers to host the data on HDFS. However, the network transfer from storage to compute cluster becomes a severe bottleneck for big data processing. Near-data processing (NDP) is a concept that aims to alleviate network load in such cases by offloading (or "pushing down") some of the compute tasks to the storage cluster. Employing NDP for Spark under the RD paradigm is challenging because storage-optimized servers have limited computational resources and cannot host the entire Spark processing stack. Further, even if such a lightweight stack could be developed and deployed on the storage cluster, it is not entirely obvious which Spark queries would benefit from pushdown, and which tasks of a given query should be pushed down to storage.This paper presents the design and implementation of a near-data processing system for Spark, SparkNDP, that aims to address the aforementioned challenges. SparkNDP works by implementing novel NDP Spark capabilities on the storage cluster using a lightweight library of SQL operators and then developing an analytical model to help determine which Spark tasks should be pushed down to storage based on the current network and system state. Simulation and prototype implementation results show that SparkNDP can help reduce Spark query execution times when compared to both the default approach of not pushing down any tasks to storage and the outright NDP approach of pushing all tasks to storage.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송