학술논문

GitFL: Uncertainty-Aware Real-Time Asynchronous Federated Learning Using Version Control

Document Type

Conference

Author

Hu, M.; Xia, Z.; Yan, D.; Yue, Z.; Xia, J.; Huang, Y.; Liu, Y.; Chen, M.

Source

2023 IEEE Real-Time Systems Symposium (RTSS) RTSS Real-Time Systems Symposium (RTSS), IEEE 2023. :145-157 Dec, 2023

Subject

Aerospace
Communication, Networking and Broadcast Technologies
Computing and Processing
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Training
Performance evaluation
Adaptation models
Real-time systems
Time factors
Internet of Things
Load modeling
AIoT
Asynchronous Federated Learning
Uncertainty
Reinforcement Learning
Version Control

Language

ISSN

2576-3172

Abstract

As a promising distributed machine learning paradigm that enables collaborative training without compromising data privacy, Federated Learning (FL) has been increasingly used in large-scale A IoT (Artificial Intelligence of Things) system design. However, due to the lack of efficient management of straggling devices, existing FL methods greatly suffer from the problems of long response time (e.g., training and communication latency) and low inference accuracy. Things become even worse when taking various uncertain factors (e.g., network delays, performance variances caused by process variation) existing in AIoT scenarios into account. To address this issue, this paper proposes a novel asynchronous FL framework named GitFL, whose implementation is inspired by the famous version control system Git. Unlike traditional FL, the cloud server of GitFL maintains a master model (i.e., the global model) together with a set of branch models indicating the trained local models committed by selected devices, where the master model is updated based on both all the pushed branch models and their version information, and only the branch models after the pull operation are dispatched to devices. By using our proposed Reinforcement Learning (RL)-based device selection mechanism, a pulled branch model with an older version will be more likely to be dispatched to a faster and less frequently selected device for the next round of local training. In this way, GitFL enables both effective controls of model staleness and adaptive load balance of versioned models among straggling devices, thus avoiding performance deterioration while ensuring real-time performance. Comprehensive experimental results on well-known models and datasets show that, compared with state-of-the-art asynchronous and synchronous FL methods, GitFL can achieve up to 2.64X training acceleration and 7.88 % inference accuracy improvements in various uncertain scenarios.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송