학술논문

Online Scheduling with Redirection for Parallel Jobs
Document Type
Conference
Source
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2020 IEEE International. :1-4 May, 2020
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Program processors
Resource management
Scheduling algorithms
Clustering algorithms
Scheduling
scheduling
Parallel jobs
redirection
Language
Abstract
An important component of High Performance Computing (HPC) clusters is the job scheduling algorithm, which decides the allocation and the scheduling of the jobs in the system. Such scheduling algorithms need to be scalable to confront the growth both in size and in complexity of the modern clusters. We propose in this paper a new algorithm for scheduling parallel jobs with redirection. Specifically, our algorithm redirects the jobs whose execution affects significantly an important number of other jobs. A redirected job is stopped and restarted from the beginning in a dedicated part of the cluster. We show the effectiveness of our method through an intensive experimental campaign of simulations of production cluster log traces.