학술논문

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Document Type

Conference

Author

Frey, Nathan C.; Li, Baolin; McDonald, Joseph; Zhao, Dan; Jones, Michael; Bestor, David; Tiwari, Devesh; Gadepally, Vijay; Samsi, Siddharth

Source

2022 IEEE High Performance Extreme Computing Conference (HPEC) High Performance Extreme Computing Conference (HPEC), 2022 IEEE. :1-8 Sep, 2022

Subject

Communication, Networking and Broadcast Technologies
Computing and Processing
Training
Deep learning
Energy consumption
Computer vision
Computational modeling
High performance computing
Neural networks

Language

ISSN

2643-1971

Abstract

Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural language processing, computer vision, and chemistry-on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy -saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송