-
Notifications
You must be signed in to change notification settings - Fork 55
delegate: allow job delegation to a different flux instance #6873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| "%" JSON_INTEGER_FORMAT | ||
| ": submission to specified Flux instance failed", | ||
| *orig_id); | ||
| flux_jobtap_raise_exception (p, *orig_id, "DelegationFailure", 0, errstr); |
Check failure
Code scanning / CodeQL
Non-constant format string Critical
flux_jobtap_raise_exception
|
Hi @trws @garlick @grondo:
Thank you for the help! |
Problem: Delegating a submitted job to a different flux instance is currently not supported. This feature is useful for users that have jobs that need to be executed in a multi-cluster setup, or for workflows where a certain sub-job needs to be executed on a peer cluster. Allow delegation of jobs to a different, peer-level flux instance by utilizing the URI of the other instance. Currently, explicit checking for feasibility of the job on the delegated instance is not performed, and it is assumed that both the instances belong to the same user.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6873 +/- ##
==========================================
- Coverage 83.85% 83.84% -0.02%
==========================================
Files 538 538
Lines 89912 89912
==========================================
- Hits 75399 75383 -16
- Misses 14513 14529 +16 🚀 New features to boost your workflow:
|
|
This is pretty neat. However it seems like its usefulness will be limited as long as the result of the proxy job is always reported as failure. I see why a fatal exception was used to get the job from DEPEND to CLEANUP state in the prototype, since there's no other way to do that currently, but I wonder if we can think up a better way that allows the proxy job result to reflect the result of the actual job. That way workflow tools, not to mention Flux native dependency schemes like We currently have a special alloc_bypass flag in the job manager that allows the Also, I wonder if we can find another way to pass in the instance URI, since embedding it in the dependency URI might cause parsing problem later on (like ambiguity about which scheme an appended |
The job-exec module already supports an "override" mode where posting of |
|
Aniruddha and I are looking into this, so we can post the events between
Let us know if we're on the right track and if we're missing someting. One question we had was about the We looked at this example of updating the urgency but it wasn't clear if that's the right approach. Thank you for the help! |
@garlick We could use a |
You can probably ignore priority for now and let the job manager
Maybe delegation could stake out its own namespace in the jobspec |
|
Sorry, I kind of lost my cache on this one!
Will you hold the job in DEPEND state while the other instance blocks in SCHED state waiting for resources? If so, that's before PRIORITY state so my comment about the priority translating over there makes no sense and I see now why you were asking. I think you would post the I think you'll need to commit an R object to the KVS at this time, or there may be fallout from
I'll let @grondo address those two as I'm less familiar with the job-exec-overide. Just a note that these events are posted to the |
|
Oh wait, I forgot that |
|
I haven't fully read through the existing code to see where we're at, but here's some quick thoughts based on conversation above: Make sure you set the
One possible solution to store the delegated instance URI would be to use a {"timestamp":1756238516.1369765,"name":"memo","context":{"delegated":"ssh://clusterX/run/flux/local"}}Tools could pick this up from the job annotations as
job execution override is handled by the job-exec module, which waits for When the job-exec module receives the Since the
Agreed. Since your plugin will be posting the Now whether you hold the job in DEPEND state or SCHED state is up to you. It does seem like it could work either way if you always set the |
Problem: Delegating a submitted job to a different flux instance is currently not supported. This feature is useful for users that have jobs that need to be executed in a multi-cluster setup, or for workflows where a certain sub-job needs to be executed on a peer cluster.
Allow delegation of jobs to a different, peer-level flux instance by utilizing the URI of the other instance. Currently, explicit checking for feasibility of the job on the delegated instance is not performed, and it is assumed that both the instances belong to the same user.
Co-authored by: James Corbett