-
Notifications
You must be signed in to change notification settings - Fork 55
WIP: support resource.reserve configuration parameter
#6547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
This seems workable to me. I can't really think of a case where a system instance would want to reserve different cores on different node types and run into awkwardness specifying those as ranks. If that ever does come up then hostlist support might be nice. |
|
It would be good to get feedback from @ryanday36 if this is the correct approach for what's needed in the near term. I'm not sure if the requirements are to allow users to reserve cores for the system in their batch jobs, while the default is not to do so, or if the need is to reserve cores in the system instance, then give users an option to opt-out. I'd hate to add a new config key (and whole new way to select resources) if it isn't going to end up useful. |
|
My understanding for what we want with the "OS" or "system" cores is to have no user tasks run on the system cores unless the user explicitly asks for them. So, this |
In this case the resources are excluded from the scheduler so there is no way to ask for them, except perhaps by launching a new instance (i.e. via either @garlick's idea of a partition might be slightly more suitable, except that, AIUI, the proposed partitions are not allowed to overlap so you would not be able to run a job across all partitions. That may be easily resolved however. In any event, I think this requirement implies that the scheduler needs to know about reserved resources specifically, so withholding them like |
Would that be all that bad? I think we decided the partition idea was going to take significantly more effort. |
Having users get a new instance with |
|
Well, perhaps this is ready for a review then. Is the current interface ok? Happy to take suggestions on improvements. |
Problem: rlist_copy_internal() is declared static which means all copy implementations must be defined directly in rlist.c. Expose rlist_copy_internal() in rlist_private.h to allow copy implementations to be placed in separate source files.
Problem: There is no way to copy a set of cores from an rlist, but this would be useful when creating a set of cores to reserve for the OS. Add a new function rlist_copy_core_spec() which can copy cores from an rlist object using a specification of the form: cores[@Ranks] [cores[@Ranks]]... where `cores` and the optional `ranks` are idsets of the cores/ranks to include in the copy.
Problem: There are no unit tests of the new function rlist_copy_core_spec(). Add unit tests.
Problem: It would be useful to have an rlist diff routine the modifies the rlist argument, but this code is embedded in rlist_diff(), which returns a new rlist. Split rlist_subtract() out of rlist_diff() and make it public. Have rlist_diff() call rlist_subtract() internally.
Problem: There is no way to reserve cores so they are not used for scheduling. Add a 'reserve' subsystem to the core resource module that takes a reserved set of cores via the 'cores[@Ranks]' spec. Nothing is done with the reserve subsystem at this point.
Problem: The `rlist` output format has `+` and `:` transposed, which results in an extraneous `+` in the output. Fix the format.
Problem: There is no way to reserve a set of cores so they cannot be used by the scheduler of a Flux instance. Add support for a new config key `resource.reserve`, which takes a string of the form `cores[@Ranks]` where `cores` and the optional `ranks` are idsets specifying the cores to reserve and the ranks on which to reserve them. If ranks is not specified, then the spec applies to all ranks. The reserved set of cores are subtracted from the resource set before it is handed off to the scheduler. The reserved resource set is also removed from the status response used by `flux resource list`.
Problem: The `reserve` key in the `[resource]` table is not documented. Document it.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6547 +/- ##
==========================================
- Coverage 79.47% 79.37% -0.11%
==========================================
Files 531 533 +2
Lines 88433 88573 +140
==========================================
+ Hits 70282 70302 +20
- Misses 18151 18271 +120
|
garlick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it feasible to add a high level test in sharness for this?
This WIP PR addresses #6315 and #5240 by adding a new config key
resource.exclude, which allows specification of a set of cores to hold back from the scheduler along with optional ranks on which to reserve them. Cores are specified in the form:Where
coresand the optionalranksare idsets. If ranks is not specified, then the same cores are reserved on all ranks. To reserve different sets of cores on different ranks, multiple specifications may be used separated by whitespace.For example,
0would withold core 0 on all ranks,0-1@0 0would reserve0-1on rank 0 and0on all other ranks, etc.Marking this as a WIP because it is just one possible approach. We could also easily extend the
resource.excludeconfig key to allow a similar syntax. There are possibly some drawbacks here:excludeis encoded as an idset in theresource.statusresponse - we'd have to always condense the excluded resource set into an idset for backwards compatibility.excludeallows specification of a hostlist, which doesn't make as much sense for resource reservation (though it could be easily handled.)resource.excludeis perhaps different thanresource.reserve-- e.g. reserving the same set of cores across all nodes of a job would be slightly more awkward if onlyresource.excludesyntax was supported (since specification of ranks would be obligatory)However, it does seem awkward to have two config keys that do similar things.
The other question here is how to present the reserved resources in either
flux resource status(which currently does not include sub-rank resources in default output anyway) orflux resource list.