학술논문

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Document Type

Conference

Author

Zhang, Lin; Zhang, Longteng; Shi, Shaohuai; Chu, Xiaowen; Li, Bo

Source

2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS) ICDCS Distributed Computing Systems (ICDCS), 2023 IEEE 43rd International Conference on. :361-371 Jul, 2023

Subject

Communication, Networking and Broadcast Technologies
Computing and Processing
Deep learning
Training
Tensors
Quantization (signal)
Stochastic processes
Ethernet
Distributed computing
Distributed Deep Learning
Gradient Compression
Power-SGD
System Optimization

Language

ISSN

2575-8411

Abstract

To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06× and 1.43× speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송