Skip to content

flux assumes down nodes can come back, but this is only true for bootstrap from configuration #6641

@grondo

Description

@grondo

This is an issue related to resilient batch jobs, first mentioned in #6304. If a batch job has configured -o exit-timeout=none, and non-critical ranks are lost up to the point where there are not enough available nodes to run any pending job, the instance will hang until timeout. This is because Flux is currently designed to assume that down nodes may eventually return to service, but this not supported except when bootstrap uses a config file, so does not apply to jobs.

I'm not exactly sure how to address this, since the down node assumption is currently fundamental. Note that this cannot just be handled in a special submission feasibility plugin because jobs could already be pending when a node is lost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions