학술논문

Interactive Analytic DBMSs: Breaching the Scalability Wall
Document Type
Conference
Source
2021 IEEE 37th International Conference on Data Engineering (ICDE) ICDE Data Engineering (ICDE), 2021 IEEE 37th International Conference on. :432-443 Apr, 2021
Subject
Computing and Processing
Measurement
Fault tolerance
Social networking (online)
Scalability
Fault tolerant systems
Production
Load management
Interactive DBMS
analytics
sharding
Language
ISSN
2375-026X
Abstract
Analytic DBMSs optimized for query interactivity commonly push the computation down to storage nodes, thus avoiding large network transfers and keeping query execution wall-time to a minimum. In these systems, data is sharded and stored locally by cluster nodes, which must all participate in query execution. As the system scales-out, hardware failures and other non-deterministic sources of tail latency start to dominate, to a point where query latency and success ratio increasingly violate the system’s SLA. We refer to this tipping point as the system’s scalability wall, when sharding data between more nodes only worsens the problem.This paper describes how an analytic DBMS optimized for low-latency queries can breach the scalability wall by sharding different tables to different subsets of cluster nodes — a strategy we call partial sharding — and reduce the query fan-out. Because partial sharding requires the DBMS to implement many tedious and complex shard management tasks, such as shard mapping, load balancing and fault tolerance, this paper describes how a database system can leverage an external general-purpose shard management service for such tasks. We present a case study based on Cubrick, an in-memory analytic DBMS developed at Facebook, highlighting the integration points with a shard management framework called Shard Manager. Finally, we describe the many design decisions, pitfalls and lessons learned during this process, which eventually allowed Cubrick to scale to thousands of nodes.