학술논문

Ultron-AutoML: an open-source, distributed, scalable framework for efficient hyper-parameter optimization
Document Type
Conference
Source
2020 IEEE International Conference on Big Data (Big Data) Big Data (Big Data), 2020 IEEE International Conference on. :1584-1593 Dec, 2020
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Engineering Profession
Geoscience
Signal Processing and Analysis
Training
Computational modeling
Big Data
Task analysis
Optimization
Open source software
Monitoring
Hyperparameter optimization
Machine Learning
Deep Unsupervised/Semi-supervised/Representation/ Self-supervised Learning
Language
Abstract
We present Ultron-AutoML, an open-source, distributed framework for efficient hyper-parameter optimization (HPO) of ML models. Considering that hyper-parameter optimization is compute intensive and time-consuming, the framework has been designed for reliability – the ability to successfully complete an HPO Job in a multi-tenant, failure prone environment, as well as efficiency – completing the job with minimum compute cost and wall-clock time. From a user’s perspective, the framework emphasizes ease of use and customizability. The user can declaratively specify and execute an HPO Job, while ancillary tasks – containerizing and running the user’s scripts, model checkpointing, monitoring progress, parallelization – are handled by the framework. At the same time, the user has complete flexibility in composing the code-base for specifying the ML model training algorithm as well as, optionally, any custom HPO algorithm. The framework supports the creation of data-pipelines to stream batches of shuffled and augmented data from a distributed file system. This comes in handy for training Deep Learning models based on self-supervised, semi-supervised or representation learning algorithms over large training datasets. We demonstrate the framework’s reliability and efficiency by running a BERT pre-training job over a large training corpus using pre-emptible GPU compute targets. Despite the inherent unreliability of the underlying compute nodes, the framework is able to complete such long running jobs at 30% of the cost with a marginal increase in wall-clock time. The framework also comes with a service to monitor jobs and ensures reproducibility of any result.