-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Labels
Description
Detailed Description
Sometimes srun loses the connection to the started job. In this case it would be nice if the cluster-tools can try to reconnect to a running job (e.g., using sattach)
Use Cases & Context
The following error is sometimes returned by srun.
srun: error: Unable to confirm allocation for job <id>: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job <id>
This is erroneously reported as failed job.
Instead of that the cluster-tools could try to reconnect a running job (e.g., with sattach)