Skip to content

Reattach on connection timeouts #1360

@robert-oleynik

Description

@robert-oleynik

Detailed Description

Sometimes srun loses the connection to the started job. In this case it would be nice if the cluster-tools can try to reconnect to a running job (e.g., using sattach)

Use Cases & Context

The following error is sometimes returned by srun.

srun: error: Unable to confirm allocation for job <id>: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job <id>

This is erroneously reported as failed job.
Instead of that the cluster-tools could try to reconnect a running job (e.g., with sattach)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions