학술논문

Fast Performance Prediction for Efficient Distributed DNN Training

Document Type

Periodical

Author

Source

IEEE Computer Architecture Letters IEEE Comput. Arch. Lett. Computer Architecture Letters. 22(2):133-136 Dec, 2023

Subject

Computing and Processing
Optimization
Costs
Performance evaluation
Parallel processing
Training
Throughput
Tensors
Distributed training
performance modeling
large language model
3D parallelism

Language

ISSN

1556-6056
1556-6064
2473-2575

Abstract

Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송