학술논문

Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing
Document Type
Conference
Source
2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS) ICPADS Parallel and Distributed Systems (ICPADS), 2021 IEEE 27th International Conference on. :442-449 Dec, 2021
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Training
Deep learning
Processor scheduling
Error analysis
Computational modeling
Conferences
Neural networks
Job Scheduling
Reinforcement Learning
High-Performance Computing
Language
ISSN
2690-5965
Abstract
Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.