학술논문

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

Document Type

Periodical

Author

Source

IEEE Transactions on Parallel and Distributed Systems IEEE Trans. Parallel Distrib. Syst. Parallel and Distributed Systems, IEEE Transactions on. 34(9):2553-2567 Sep, 2023

Subject

Computing and Processing
Communication, Networking and Broadcast Technologies
Training
Task analysis
Graphics processing units
Resource management
Deep learning
Costs
Load modeling
Deep learning system
distributed training
elastic deep learning
GPU cluster scheduling

Language

ISSN

1045-9219
1558-2183
2161-9883

Abstract

Deep learning tasks (DLT) include training and inference tasks, where training DLTs have requirements on minimizing average job completion time (JCT) and inference tasks need sufficient GPUs to meet real-time performance. Unfortunately, existing work separately deploys multi-tenant training and inference GPU cluster, leading to the high JCT of training DLTs with limited GPUs when the inference cluster is under insufficient GPU utilization due to the periodic inference workload. DeepBoot solves the challenges by utilizing idle GPUs in the inference cluster for the training DLTs. Specifically, 1) DeepBoot designs adaptive task scaling (ATS) algorithm to allocate GPUs in the training and inference clusters for training DLTs and minimize the performance loss when reclaiming inference GPUs. 2) DeepBoot implements auto-fast elastic (AFE) training based on Pollux to reduce the restart overhead by inference GPU reclaiming. Our implementation on the testbed and large-scale simulation in Microsoft deep learning workload shows that DeepBoot can achieve 32% and 38% average JCT reduction respectively compared with the scheduler without utilizing idle GPUs in the inference cluster.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송