학술논문

Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms.

Document Type

Article

Author

Cheng, Daning; Li, Shigang; Zhang, Hanping; Xia, Fen; Zhang, Yunquan

Source

IEEE Transactions on Parallel & Distributed Systems. Jul2021, Vol. 32 Issue 7, p1702-1712. 11p.

Subject

*Machine learning
*Parallel algorithms
*Mathematical optimization
*Random forest algorithms
*Support vector machines
*Algorithms

Language

ISSN

1045-9219

Abstract

As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms. [ABSTRACT FROM AUTHOR]

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송