학술논문

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning
Document Type
Conference
Source
2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS) ICDCS Distributed Computing Systems (ICDCS), 2023 IEEE 43rd International Conference on. :361-371 Jul, 2023
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Deep learning
Training
Tensors
Quantization (signal)
Stochastic processes
Ethernet
Distributed computing
Distributed Deep Learning
Gradient Compression
Power-SGD
System Optimization
Language
ISSN
2575-8411
Abstract
To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06× and 1.43× speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).