flux assumes down nodes can come back, but this is only true for bootstrap from configuration

This is an issue related to resilient batch jobs, first mentioned in #6304. If a batch job has configured `-o exit-timeout=none`, and non-critical ranks are lost up to the point where there are not enough available nodes to run any pending job, the instance will hang until timeout. This is because Flux is currently designed to assume that down nodes may eventually return to service, but this not supported except when bootstrap uses a config file, so does not apply to jobs.

I'm not exactly sure how to address this, since the down node assumption is currently fundamental. Note that this cannot just be handled in a special submission feasibility plugin because jobs could already be pending when a node is lost.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flux assumes down nodes can come back, but this is only true for bootstrap from configuration #6641

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

flux assumes down nodes can come back, but this is only true for bootstrap from configuration #6641

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions