delegate: allow job delegation to a different flux instance #6873

tpatki · 2025-06-07T02:14:00Z

Problem: Delegating a submitted job to a different flux instance is currently not supported. This feature is useful for users that have jobs that need to be executed in a multi-cluster setup, or for workflows where a certain sub-job needs to be executed on a peer cluster.

Allow delegation of jobs to a different, peer-level flux instance by utilizing the URI of the other instance. Currently, explicit checking for feasibility of the job on the delegated instance is not performed, and it is assumed that both the instances belong to the same user.

Co-authored by: James Corbett

src/modules/job-manager/plugins/delegate.c

+                        "%" JSON_INTEGER_FORMAT
+                        ": submission to specified Flux instance failed",
+                        *orig_id);
+        flux_jobtap_raise_exception (p, *orig_id, "DelegationFailure", 0, errstr);


tpatki · 2025-06-07T02:21:10Z

Hi @trws @garlick @grondo:
Couple of questions as I work on polishing this:

Do I need to add an associated test case for this? If so, should that be a sharness style test under t/ or under t/job_manager/plugins?
Where do I put the associated docs for this plugin (see here, for example: https://github.com/flux-framework/flux-multi-cluster-utilities/blob/main/README.md)? I could add a rst file under flux-core/doc/man7, is that the right place for it?

Thank you for the help!

Problem: Delegating a submitted job to a different flux instance is currently not supported. This feature is useful for users that have jobs that need to be executed in a multi-cluster setup, or for workflows where a certain sub-job needs to be executed on a peer cluster. Allow delegation of jobs to a different, peer-level flux instance by utilizing the URI of the other instance. Currently, explicit checking for feasibility of the job on the delegated instance is not performed, and it is assumed that both the instances belong to the same user.

codecov · 2025-06-07T02:59:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.84%. Comparing base (e699349) to head (c9d648b).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6873      +/-   ##
==========================================
- Coverage   83.85%   83.84%   -0.02%     
==========================================
  Files         538      538              
  Lines       89912    89912              
==========================================
- Hits        75399    75383      -16     
- Misses      14513    14529      +16

see 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

garlick · 2025-06-14T14:10:05Z

This is pretty neat. However it seems like its usefulness will be limited as long as the result of the proxy job is always reported as failure. I see why a fatal exception was used to get the job from DEPEND to CLEANUP state in the prototype, since there's no other way to do that currently, but I wonder if we can think up a better way that allows the proxy job result to reflect the result of the actual job. That way workflow tools, not to mention Flux native dependency schemes like afterok ,would not need to treat delegated jobs specially when they need to fetch the job result.

We currently have a special alloc_bypass flag in the job manager that allows the alloc-bypass jobtap plugin to skip scheduling and post the alloc and free events on the scheduler's behalf. I wonder if we could add an exec_bypass flag and let this plugin use both flags and post alloc (adding a fake or empty R), free, start, and finish (with job result) and have the proxy job track the state of the actual job?

Also, I wonder if we can find another way to pass in the instance URI, since embedding it in the dependency URI might cause parsing problem later on (like ambiguity about which scheme an appended ?option=value belongs to)?

grondo · 2025-06-14T14:18:11Z

I wonder if we could add an exec_bypass flag and let this plugin use both flags and post alloc (adding a fake or empty R), free, start, and finish (with job result) and have the proxy job track the state of the actual job?

The job-exec module already supports an "override" mode where posting of start and finish events are handed off to another entity (which can post them via RPC to the job-exec module). See flux job-exec-override and associated tests for examples.

tpatki · 2025-09-12T22:14:18Z

Hi @garlick @grondo @trws

Aniruddha and I are looking into this, so we can post the events between depend and cleanup. Our current understanding is as follows.

We post an alloc event similar to this example from the alloc_bypass plugin.
We use job-exec.override with flux_rpc (the C version) and post a start and then a finish event with payloads similar to cmd/flux-job-exec-override.py.
We post a free event, similar to the alloc event, by replacing the "alloc" with "free".

Let us know if we're on the right track and if we're missing someting.

One question we had was about the priority event, do we need to explicitly post this as well? And if so, what should the default priority (or urgency) be ? Is there an example for this?

We looked at this example of updating the urgency but it wasn't clear if that's the right approach.

Thank you for the help!

tpatki · 2025-09-12T22:16:36Z

Also, I wonder if we can find another way to pass in the instance URI, since embedding it in the dependency URI might cause parsing problem later on (like ambiguity about which scheme an appended ?option=value belongs to)?

@garlick We could use a config file that has the URI that the plugin utilizes, but that will be static (same URI for all jobs submitted to that instance) and not on a per-job/per-flux-submit basis. Other ideas?

garlick · 2025-09-13T14:31:04Z

One question we had was about the priority event, do we need to explicitly post this as well?

You can probably ignore priority for now and let the job manager priority-default plugin post it. In flux instances where flux-accounting is not overriding priority-default, the job priority is set to the urgency, which should translate fine across instances.

We could use a config file that has the URI that the plugin utilizes, but that will be static (same URI for all jobs submitted to that instance) and not on a per-job/per-flux-submit basis. Other ideas?

Maybe delegation could stake out its own namespace in the jobspec attributes.system dict and then a CLIPlugin could be written to set attributes.system.delegate.fluxuri or similar? These are new and I haven't tried them but it looks like there are some nice examples under t/cli-plugins.

garlick · 2025-09-13T15:00:44Z

Sorry, I kind of lost my cache on this one!

We post an alloc event similar to this example from the alloc_bypass plugin.

Will you hold the job in DEPEND state while the other instance blocks in SCHED state waiting for resources? If so, that's before PRIORITY state so my comment about the priority translating over there makes no sense and I see now why you were asking. I think you would post the depend event once the job is scheduled remotely, then let the local priority plugin post the priority (which is just to advance the state since it won't be used in scheduling), then upon reaching SCHED state, immediately post the alloc event.

I think you'll need to commit an R object to the KVS at this time, or there may be fallout from flux jobs and other tooling that expects it to be there.

We use job-exec.override with flux_rpc (the C version) and post a start and then a finish event with payloads similar to cmd/flux-job-exec-override.py.

We post a free event, similar to the alloc event, by replacing the "alloc" with "free".

I'll let @grondo address those two as I'm less familiar with the job-exec-overide. Just a note that these events are posted to the guest.exec.eventlog not the main job eventlog.

garlick · 2025-09-13T15:06:18Z

Oh wait, I forgot that alloc-bypass has a special flag in the job manager that prevents the job from being queued for the local scheduler. So maybe it is feasible to hold the job in SCHED state instead of DEPEND?

grondo · 2025-09-15T18:54:52Z

I haven't fully read through the existing code to see where we're at, but here's some quick thoughts based on conversation above:

Make sure you set the alloc-bypass flag before the job enters the SCHED state. This will prevent the alloc request being sent to the local scheduler.

We could use a config file that has the URI that the plugin utilizes, but that will be static (same URI for all jobs submitted to that instance) and not on a per-job/per-flux-submit basis. Other ideas?

One possible solution to store the delegated instance URI would be to use a memo event, for example:

  {"timestamp":1756238516.1369765,"name":"memo","context":{"delegated":"ssh://clusterX/run/flux/local"}}

Tools could pick this up from the job annotations as user.delegated, could wait on this event being posted to see if the job was delegated, etc.

I'll let @grondo address those two as I'm less familiar with the job-exec-overide. Just a note that these events are posted to the guest.exec.eventlog not the main job eventlog.

job execution override is handled by the job-exec module, which waits for start and finish events to be sent via an RPC instead of launching the job shells as would occur in a normal job. One potential gotcha here is that only instance owner jobs are allowed to use the job-exec override.

When the job-exec module receives the start RPC (as in the flux job-exec-override command), it sends an RFC 32 start response (which I believe does result in the start event in the main job eventlog). Similarly, the job-exec module waits for the 3rd party finish RPC to send the finish response (which result in the finish event posted to the job eventlog). These take the place of job-exec waiting for all job shells to be launched (start event) and all job shells to complete (finish event)

Since the finish RPC takes a wait status parameter, this can be used to communicate whether the delegated job succeeded or failed. Once the finish event is posted, then the job will proceed to CLEANUP (skipping per-rank epilog and housekeeping due to the alloc-bypass flag)

.I think you would post the depend event once the job is scheduled remotely, then let the local priority plugin post the priority (which is just to advance the state since it won't be used in scheduling), then upon reaching SCHED state, immediately post the alloc event.

Agreed. Since your plugin will be posting the alloc event, the value of the job priority is inconsequential.

Now whether you hold the job in DEPEND state or SCHED state is up to you. It does seem like it could work either way if you always set the alloc-bypass flag before the depend event is posted.

tpatki added the in progress label Jun 7, 2025

github-advanced-security bot found potential problems Jun 7, 2025

View reviewed changes

tpatki changed the title ~~delegate: allow delegation of a job to a different flux instance~~ delegate: allow job delegation to a different flux instance Jun 7, 2025

tpatki force-pushed the jobtap_delegate branch from 770ffcb to 4c42b36 Compare June 7, 2025 02:26

tpatki force-pushed the jobtap_delegate branch from 4c42b36 to c9d648b Compare June 7, 2025 02:32

tpatki marked this pull request as draft June 7, 2025 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

delegate: allow job delegation to a different flux instance #6873

delegate: allow job delegation to a different flux instance #6873

Uh oh!

tpatki commented Jun 7, 2025 •

edited

Loading

Uh oh!

Check failure

tpatki commented Jun 7, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 7, 2025

Uh oh!

garlick commented Jun 14, 2025

Uh oh!

grondo commented Jun 14, 2025

Uh oh!

tpatki commented Sep 12, 2025

Uh oh!

tpatki commented Sep 12, 2025

Uh oh!

garlick commented Sep 13, 2025

Uh oh!

garlick commented Sep 13, 2025

Uh oh!

garlick commented Sep 13, 2025

Uh oh!

grondo commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

delegate: allow job delegation to a different flux instance #6873

Are you sure you want to change the base?

delegate: allow job delegation to a different flux instance #6873

Uh oh!

Conversation

tpatki commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check failure

Uh oh!

tpatki commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 7, 2025

Codecov Report

Uh oh!

garlick commented Jun 14, 2025

Uh oh!

grondo commented Jun 14, 2025

Uh oh!

tpatki commented Sep 12, 2025

Uh oh!

tpatki commented Sep 12, 2025

Uh oh!

garlick commented Sep 13, 2025

Uh oh!

garlick commented Sep 13, 2025

Uh oh!

garlick commented Sep 13, 2025

Uh oh!

grondo commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tpatki commented Jun 7, 2025 •

edited

Loading

tpatki commented Jun 7, 2025 •

edited

Loading