학술논문

Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic Networks

Document Type

Periodical

Author

Fan, L.; Zhang, X.; Zhao, Y.; Sood, K.; Yu, S.

Source

IEEE Transactions on Cognitive Communications and Networking IEEE Trans. Cogn. Commun. Netw. Cognitive Communications and Networking, IEEE Transactions on. 10(1):277-291 Feb, 2024

Subject

Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Training
Wide area networks
Bandwidth
Resource management
Machine learning
Synchronization
Data models
Geo-distributed machine leaning
training jobs
resource allocation
online scheduling

Language

ISSN

2332-7731
2372-2045

Abstract

Geo-Distributed Machine Leaning (Geo-DML) has been a promising technology, which performs collaborative learning across geographically dispersed data centers (DCs) with privacy-preserving over Wide Area Networks (WANs). Unfortunately, the limited and heterogeneous WAN bandwidth poses significant challenges to the performance of Geo-DML systems, leading to increased communication overhead and affecting the revenue of ISPs eventually. In particular, when multiple online jobs coexist in Geo-DML systems, the competition for bandwidth between training flows of different jobs aggravates this negative impact. To alleviate it, this paper investigates the problem of online training flow scheduling for Geo-DML jobs. We first formulate the studied problem as an Linear Programming (LP) model with the objective of maximizing the revenue of ISPs. Then, we propose an online traffic scheduling algorithm called Training Flow Adaptive Steering (TFAS), which exploits a primal-dual framework, tailored for efficient resource allocation of jobs to schedule training flows, such that system resources are maximally utilized and training procedures can be expedited and completed in a timely manner. Meanwhile, we conduct rigorous theoretical analysis to guarantee that the proposed algorithm can achieve a good competitive ratio. Extensive evaluation results demonstrate that our algorithm performs well and outperforms commonly adopted solutions 36.2%-49.4% in average.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송