학술논문

A Load-Balancing Strategy Based on Multi-Task Learning in a Distributed Training Environment
Document Type
Conference
Source
2023 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA) AEECA Advances in Electrical Engineering and Computer Applications (AEECA), 2023 International Conference on. :862-868 Aug, 2023
Subject
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Training
Computational modeling
Pressing
Predictive models
Parallel processing
Multitasking
Data models
Load balancing
Distributed training
Parallel computing model
Multi-task learning
MMoE
Language
Abstract
With the development of machine learning and big data technologies, distributed training has become an important way to improve computational efficiency. However, in the distributed training environment, the performance difference between workers and the interference of unrelated tasks may lead to the uneven load of the system and cause the "straggler phenomenon". Therefore, how to solve the straggler phenomenon and achieve load balancing of the system is a pressing problem of distributed training. To address the shortcomings of the existing parallel computing model DSSP, this paper proposes a load-balancing strategy for DDP, which effectively reduces the load difference by adjusting the batch size and data size of workers in the training process. Then, to explore the correlation between data size in DDP and synchronization threshold in DSSP in dynamic adjustment, we perform multi-task learning for the two dynamic adjustment strategies and integrate the proposed Joint Multi-Task Prediction scheme on DSSP to implement a new parallel computing model ESP. extensive experiments The results show that ESP can not only guarantee the model accuracy but also effectively improve the training speed in distributed training.