-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Use Case Summary
It would be helpful if there was a mechanism by which a scheduler could negotiate an increase or decrease in the number of nodes available to a job. This is essentially a request for what is sometimes referred to as malleable scheduling. This would require changes to both the scheduler and the workflow system (or any other malleable job) as well as some standard way of negotiating the changes.
Use Case Details
To support "on-demand" computing we intend to designate a subset of our resources to be eligible for preemption to make room for "on-demand" or "deadline driven" jobs. An example would be a light source that wants to take a dataset, send it to our facility for analysis, and wants the results back ASAP, certainly not after it sits in a queue for hours. We have chosen to evaluate the preemption path to try and minimize "wasted" cycles, as the nodes might otherwise sit idle for a significant fraction of the time if we just dedicated nodes to them (how much time is wasted obviously depends on the use case). To minimize the impact of preemption, it would be ideal if we preferentially scheduled small, short jobs on the resources designated for preemption so we could kill the minimum number of jobs to get the resources we need (we could kill ten one node jobs, to get 10 nodes rather than a 100 node job) and lose a minimum amount of computation (if the jobs are only minutes in length we lose at most minutes of computation) . However, in many cases, those small, short jobs are not visible to the scheduler because the workflow system has submitted a job to "provision" a larger set of resources for a longer period of time and then runs many smaller jobs within those resources (i.e. a Condor "glide-in"). This means that today when preempting, the scheduler would be forced to kill the entire job, even if it needed only a small fraction of the nodes it was using.
A use case involving an increase in the number of nodes is also potentially valuable. As a scheduler, if I know that I have a "drain window" (nodes that I expect to be empty for a period of time because I have no job in the queue that can fit until another job ends), I could offer those nodes to the workflow system to use temporarily.
If there were a mechanism that allowed a job (the workflow system) to designate itself as "malleable" to the scheduler and then a mechanism for negotiating altering the number of nodes available to the job we could potentially eliminate a lot of wasted cycles.