학술논문
Fast Performance Prediction for Efficient Distributed DNN Training
Document Type
Periodical
Source
IEEE Computer Architecture Letters IEEE Comput. Arch. Lett. Computer Architecture Letters. 22(2):133-136 Dec, 2023
Subject
Language
ISSN
1556-6056
1556-6064
2473-2575
1556-6064
2473-2575
Abstract
Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.