학술논문

Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan

Document Type

Periodical

Author

Weintraub, G.; Gudes, E.; Dolev, S.; Ullman, J.D.

Source

IEEE Transactions on Cloud Computing IEEE Trans. Cloud Comput. Cloud Computing, IEEE Transactions on. 12(1):84-99 Jan, 2024

Subject

Computing and Processing
Communication, Networking and Broadcast Technologies
Big Data applications
Cloud computing
Costs
Measurement
Engines
Computer architecture
Standards
Cloud storage
data lakes
query optimization

Language

ISSN

2168-7161
2372-0018

Abstract

Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송