Skip to content

Conversation

@tpatki
Copy link
Member

@tpatki tpatki commented Jun 7, 2025

Problem: Delegating a submitted job to a different flux instance is currently not supported. This feature is useful for users that have jobs that need to be executed in a multi-cluster setup, or for workflows where a certain sub-job needs to be executed on a peer cluster.

Allow delegation of jobs to a different, peer-level flux instance by utilizing the URI of the other instance. Currently, explicit checking for feasibility of the job on the delegated instance is not performed, and it is assumed that both the instances belong to the same user.

Co-authored by: James Corbett

"%" JSON_INTEGER_FORMAT
": submission to specified Flux instance failed",
*orig_id);
flux_jobtap_raise_exception (p, *orig_id, "DelegationFailure", 0, errstr);

Check failure

Code scanning / CodeQL

Non-constant format string Critical

The format string argument to
flux_jobtap_raise_exception
has a source which cannot be verified to originate from a string literal.
@tpatki
Copy link
Member Author

tpatki commented Jun 7, 2025

Hi @trws @garlick @grondo:
Couple of questions as I work on polishing this:

Thank you for the help!

@tpatki tpatki changed the title delegate: allow delegation of a job to a different flux instance delegate: allow job delegation to a different flux instance Jun 7, 2025
@tpatki tpatki force-pushed the jobtap_delegate branch from 770ffcb to 4c42b36 Compare June 7, 2025 02:26
Problem: Delegating a submitted job to a different flux
instance is currently not supported. This feature is useful
for users that have jobs that need to be executed in a
multi-cluster setup, or for workflows where a certain
sub-job needs to be executed on a peer cluster.

Allow delegation of jobs to a different, peer-level flux
instance by utilizing the URI of the other instance.
Currently, explicit checking for feasibility of the job
on the delegated instance is not performed, and it is
assumed that both the instances belong to the same user.
@tpatki tpatki force-pushed the jobtap_delegate branch from 4c42b36 to c9d648b Compare June 7, 2025 02:32
@tpatki tpatki marked this pull request as draft June 7, 2025 02:33
@codecov
Copy link

codecov bot commented Jun 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.84%. Comparing base (e699349) to head (c9d648b).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6873      +/-   ##
==========================================
- Coverage   83.85%   83.84%   -0.02%     
==========================================
  Files         538      538              
  Lines       89912    89912              
==========================================
- Hits        75399    75383      -16     
- Misses      14513    14529      +16     

see 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@garlick
Copy link
Member

garlick commented Jun 14, 2025

This is pretty neat. However it seems like its usefulness will be limited as long as the result of the proxy job is always reported as failure. I see why a fatal exception was used to get the job from DEPEND to CLEANUP state in the prototype, since there's no other way to do that currently, but I wonder if we can think up a better way that allows the proxy job result to reflect the result of the actual job. That way workflow tools, not to mention Flux native dependency schemes like afterok ,would not need to treat delegated jobs specially when they need to fetch the job result.

We currently have a special alloc_bypass flag in the job manager that allows the alloc-bypass jobtap plugin to skip scheduling and post the alloc and free events on the scheduler's behalf. I wonder if we could add an exec_bypass flag and let this plugin use both flags and post alloc (adding a fake or empty R), free, start, and finish (with job result) and have the proxy job track the state of the actual job?

Also, I wonder if we can find another way to pass in the instance URI, since embedding it in the dependency URI might cause parsing problem later on (like ambiguity about which scheme an appended ?option=value belongs to)?

@grondo
Copy link
Contributor

grondo commented Jun 14, 2025

I wonder if we could add an exec_bypass flag and let this plugin use both flags and post alloc (adding a fake or empty R), free, start, and finish (with job result) and have the proxy job track the state of the actual job?

The job-exec module already supports an "override" mode where posting of start and finish events are handed off to another entity (which can post them via RPC to the job-exec module). See flux job-exec-override and associated tests for examples.

@tpatki
Copy link
Member Author

tpatki commented Sep 12, 2025

Hi @garlick @grondo @trws

Aniruddha and I are looking into this, so we can post the events between depend and cleanup. Our current understanding is as follows.

  1. We post an alloc event similar to this example from the alloc_bypass plugin.
  2. We use job-exec.override with flux_rpc (the C version) and post a start and then a finish event with payloads similar to cmd/flux-job-exec-override.py.
  3. We post a free event, similar to the alloc event, by replacing the "alloc" with "free".

Let us know if we're on the right track and if we're missing someting.

One question we had was about the priority event, do we need to explicitly post this as well? And if so, what should the default priority (or urgency) be ? Is there an example for this?

We looked at this example of updating the urgency but it wasn't clear if that's the right approach.

Thank you for the help!

@tpatki
Copy link
Member Author

tpatki commented Sep 12, 2025

Also, I wonder if we can find another way to pass in the instance URI, since embedding it in the dependency URI might cause parsing problem later on (like ambiguity about which scheme an appended ?option=value belongs to)?

@garlick We could use a config file that has the URI that the plugin utilizes, but that will be static (same URI for all jobs submitted to that instance) and not on a per-job/per-flux-submit basis. Other ideas?

@garlick
Copy link
Member

garlick commented Sep 13, 2025

One question we had was about the priority event, do we need to explicitly post this as well?

You can probably ignore priority for now and let the job manager priority-default plugin post it. In flux instances where flux-accounting is not overriding priority-default, the job priority is set to the urgency, which should translate fine across instances.

We could use a config file that has the URI that the plugin utilizes, but that will be static (same URI for all jobs submitted to that instance) and not on a per-job/per-flux-submit basis. Other ideas?

Maybe delegation could stake out its own namespace in the jobspec attributes.system dict and then a CLIPlugin could be written to set attributes.system.delegate.fluxuri or similar? These are new and I haven't tried them but it looks like there are some nice examples under t/cli-plugins.

@garlick
Copy link
Member

garlick commented Sep 13, 2025

Sorry, I kind of lost my cache on this one!

  1. We post an alloc event similar to this example from the alloc_bypass plugin.

Will you hold the job in DEPEND state while the other instance blocks in SCHED state waiting for resources? If so, that's before PRIORITY state so my comment about the priority translating over there makes no sense and I see now why you were asking. I think you would post the depend event once the job is scheduled remotely, then let the local priority plugin post the priority (which is just to advance the state since it won't be used in scheduling), then upon reaching SCHED state, immediately post the alloc event.

I think you'll need to commit an R object to the KVS at this time, or there may be fallout from flux jobs and other tooling that expects it to be there.

  1. We use job-exec.override with flux_rpc (the C version) and post a start and then a finish event with payloads similar to cmd/flux-job-exec-override.py.
  1. We post a free event, similar to the alloc event, by replacing the "alloc" with "free".

I'll let @grondo address those two as I'm less familiar with the job-exec-overide. Just a note that these events are posted to the guest.exec.eventlog not the main job eventlog.

@garlick
Copy link
Member

garlick commented Sep 13, 2025

Oh wait, I forgot that alloc-bypass has a special flag in the job manager that prevents the job from being queued for the local scheduler. So maybe it is feasible to hold the job in SCHED state instead of DEPEND?

@grondo
Copy link
Contributor

grondo commented Sep 15, 2025

I haven't fully read through the existing code to see where we're at, but here's some quick thoughts based on conversation above:

Make sure you set the alloc-bypass flag before the job enters the SCHED state. This will prevent the alloc request being sent to the local scheduler.

We could use a config file that has the URI that the plugin utilizes, but that will be static (same URI for all jobs submitted to that instance) and not on a per-job/per-flux-submit basis. Other ideas?

One possible solution to store the delegated instance URI would be to use a memo event, for example:

  {"timestamp":1756238516.1369765,"name":"memo","context":{"delegated":"ssh://clusterX/run/flux/local"}}

Tools could pick this up from the job annotations as user.delegated, could wait on this event being posted to see if the job was delegated, etc.

I'll let @grondo address those two as I'm less familiar with the job-exec-overide. Just a note that these events are posted to the guest.exec.eventlog not the main job eventlog.

job execution override is handled by the job-exec module, which waits for start and finish events to be sent via an RPC instead of launching the job shells as would occur in a normal job. One potential gotcha here is that only instance owner jobs are allowed to use the job-exec override.

When the job-exec module receives the start RPC (as in the flux job-exec-override command), it sends an RFC 32 start response (which I believe does result in the start event in the main job eventlog). Similarly, the job-exec module waits for the 3rd party finish RPC to send the finish response (which result in the finish event posted to the job eventlog). These take the place of job-exec waiting for all job shells to be launched (start event) and all job shells to complete (finish event)

Since the finish RPC takes a wait status parameter, this can be used to communicate whether the delegated job succeeded or failed. Once the finish event is posted, then the job will proceed to CLEANUP (skipping per-rank epilog and housekeeping due to the alloc-bypass flag)

.I think you would post the depend event once the job is scheduled remotely, then let the local priority plugin post the priority (which is just to advance the state since it won't be used in scheduling), then upon reaching SCHED state, immediately post the alloc event.

Agreed. Since your plugin will be posting the alloc event, the value of the job priority is inconsequential.

Now whether you hold the job in DEPEND state or SCHED state is up to you. It does seem like it could work either way if you always set the alloc-bypass flag before the depend event is posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants