학술논문

LightPool: A NVMe-oF-based High-performance and Lightweight Storage Pool Architecture for Cloud-Native Distributed Database
Document Type
Conference
Source
2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA) HPCA High-Performance Computer Architecture (HPCA), 2024 IEEE International Symposium on. :983-995 Mar, 2024
Subject
Computing and Processing
Distributed databases
Computer architecture
Production
Bandwidth
Throughput
Resource management
Best practices
NVMe-oF
NVMe
High-performance Storage
Cloud-native Storage
Language
ISSN
2378-203X
Abstract
Emerging cloud-native distributed databases rely on local NVMe SSDs to provide high-performance and highavailable data services to many cloud applications. However, the database clusters suffer from low utilization of local storage because of the imbalance between CPU and storage capacities within each node. For instance, the OceanBase distributed database cluster, with hundreds of PB local storage capacity, only utilizes around 40% of its local storage. Although disaggregated storage (EBS) can enhance storage utilization by provisioning the CPU and storage independently on demand, they suffer from performance bottlenecks and high costs. In this paper, we propose LightPool, a high-performance and lightweight storage pool architecture large-scale deployed in the OceanBase clusters, enhancing storage resource utilization. The key idea of LightPool is aggregating cluster storage into a storage pool and enabling unified management. In particular, LightPool adopts NVMe-oF to enable high-performance storage resource sharing among cluster nodes and integrate the storage pool with Kubernetes to achieve flexible management and allocation of storage resources. Furthermore, we design the hot-upgrade and hot-migration mechanisms to enhance the availability of LightPool. We have deployed LightPool on over 8500 nodes in production clusters. Statistics show that LightPool can improve storage resource utilization from about 40% to 65%. Experimental results show that the extra latency from LightPool is only about 2.1 μs compared to local storage. Compared to OpenEBS, LightPool enhances bandwidth up to 190.9% in microbenchmarks and throughput up to 6.9% in real-world applications. LightPool is the best practice to deploy NVMe-oF (NVMe/TCP) in the production environment. We also discuss important lessons and experiences learned from the development of LightPool.