학술논문

Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic Networks
Document Type
Periodical
Source
IEEE Transactions on Cognitive Communications and Networking IEEE Trans. Cogn. Commun. Netw. Cognitive Communications and Networking, IEEE Transactions on. 10(1):277-291 Feb, 2024
Subject
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Training
Wide area networks
Bandwidth
Resource management
Machine learning
Synchronization
Data models
Geo-distributed machine leaning
training jobs
resource allocation
online scheduling
Language
ISSN
2332-7731
2372-2045
Abstract
Geo-Distributed Machine Leaning (Geo-DML) has been a promising technology, which performs collaborative learning across geographically dispersed data centers (DCs) with privacy-preserving over Wide Area Networks (WANs). Unfortunately, the limited and heterogeneous WAN bandwidth poses significant challenges to the performance of Geo-DML systems, leading to increased communication overhead and affecting the revenue of ISPs eventually. In particular, when multiple online jobs coexist in Geo-DML systems, the competition for bandwidth between training flows of different jobs aggravates this negative impact. To alleviate it, this paper investigates the problem of online training flow scheduling for Geo-DML jobs. We first formulate the studied problem as an Linear Programming (LP) model with the objective of maximizing the revenue of ISPs. Then, we propose an online traffic scheduling algorithm called Training Flow Adaptive Steering (TFAS), which exploits a primal-dual framework, tailored for efficient resource allocation of jobs to schedule training flows, such that system resources are maximally utilized and training procedures can be expedited and completed in a timely manner. Meanwhile, we conduct rigorous theoretical analysis to guarantee that the proposed algorithm can achieve a good competitive ratio. Extensive evaluation results demonstrate that our algorithm performs well and outperforms commonly adopted solutions 36.2%-49.4% in average.