학술논문

High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
Document Type
Periodical
Source
IEEE Transactions on Parallel and Distributed Systems IEEE Trans. Parallel Distrib. Syst. Parallel and Distributed Systems, IEEE Transactions on. 34(11):2946-2964 Nov, 2023
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
Training
Processor scheduling
Fluids
Data models
Graphics processing units
Containers
Job shop scheduling
Cloud native
dataset abstraction
elastic data cache
job scheduling
Language
ISSN
1045-9219
1558-2183
2161-9883
Abstract
Nowdays, it is prevalent to train deep learning models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access, lack of matching dynamic I/O requirement, and inefficient I/O resource scheduling across different jobs. We propose Fluid , a cloud-native platform that provides DL training jobs with high-level data abstraction called Fluid Dataset to access training data from heterogeneous sources with elastic data acceleration. In addition, it comes with an on-the-fly cache system autoscaler that can match the online training speed and increase the number of cache replicas adaptively to alleviate I/O bottlenecks. To improve the overall performance of multiple DL jobs, Fluid co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order and can also schedule data cache and DL jobs on the same node to realize cache affinity. Experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. For scheduling multiple DL jobs with same datasets, Fluid achieves around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions through the appropriate job scheduling order. Besides, the cache affinity scheduling policy also improves job execution performance significantly. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with many production adopters.