학술논문

Efficient Distributed Range Query Processing in Apache Spark
Document Type
Conference
Source
2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Cluster, Cloud and Grid Computing (CCGRID), 2019 19th IEEE/ACM International Symposium on. :569-575 May, 2019
Subject
Computing and Processing
Sparks
Interpolation
Peer-to-peer computing
Cluster computing
Complexity theory
Query processing
Informatics
Range Queries
Apache
Spark
Indexing
Performance Evaulation
Language
Abstract
Range queries are important in many diverse applications. In its simplest one-dimensional form, a range query is expressed by an interval [a, b] on the real line, whereas the answer consists of all elements ε in [a, b]. In this work, we focus on efficient range query processing techniques in the Apache Spark engine, which is the state-of-the-art solution for big data management and analytics. We aim at developing a Spark-based indexing scheme that supports range queries in such large-scale decentralized environments and scale well w.r.t. the number of nodes and the data items stored. Towards this goal, there have been solutions in the last few years, which however turn out to be inadequate at the envisaged scale, since the classic linear or even the logarithmic complexity (for point queries) is still too expensive, whereas range query processing is even more demanding. In this paper, we go one step further and present a solution with sub-logarithmic complexity. In particular, we present SPIS (SPark-based Interpolation Search), a tree structure that outperforms the existing Spark built-in lookup techniques. We carry out an experimental evaluation by using synthetic data sets. Our experimental results demonstrate the efficiency and scalability of the proposed approach.