Skip to content

Unable to requeue a job after sigterm signal on slurm #20542

@jkobject

Description

@jkobject

Bug description

When running a model fit function on a slurm cluster everything happens correctly but when the time is out I receive

Epoch 10:  93%|█████████▎| 18625/20000 [5:43:21<25:20,  0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1
Epoch 10:  93%|█████████▎| 18626/20000 [5:43:21<25:19,  0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1.490, gen_emb_independence

slurmstepd: error: *** STEP 55595933.0 ON maestro-3017 CANCELLED AT 2025-01-10T15:27:25 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[rank: 0] Received SIGTERM: 15
Bypassing SIGTERM: 15

Epoch 10:  93%|█████████▎| 18626/20000 [5:43:21<25:19,  0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10:  93%|█████████▎| 18627/20000 [5:43:23<25:18,  0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10:  93%|█████████▎| 18627/20000 [5:43:23<25:18,  0.90it/s, v_num=0txx, train_loss=3.330, denoise_60%_expr=1.270, denoise_60%_emb_independence=0.0689, denoise_60%_cls=0.368, denoise_60%_ecs=0.867, gen_expr=1.460, gen_emb_independence=0.0584, gen_ecs=0.868, cce=0.480]

wandb: 🚀 View run super-dream-58 at: https://wandb.ai/ml4ig/scprint_v2/runs/k2oz0txx
wandb: Find logs at: ../../../../zeus/projets/p02/ml4ig_hot/Users/jkalfon/wandb/run-20250107_152923-k2oz0txx/logs

Unfortunately the model never requeues and doesn't even save a checkpoint...
It seems I don't have to add anything here in my config.yml but even when adding

plugins:
    - class_path: lightning.pytorch.plugins.environments.SLURMEnvironment
      init_args:
        requeue_signal: SIGHUP

it doesn't change anything, I have also specified --signal=SIGUSR1@90 in my sbatch cmd.
Is there a solution?

What version are you seeing the problem on?

v2.4

How to reproduce the bug

git clone https://github.com/cantinilab/scPRINT
follow installation instruction
sbatch -p gpu -q gpu --gres=gpu:A100:1,gmem:80G --cpus-per-task 20 --mem-per-gpu 80G --ntasks-per-node=1 --signal=SIGUSR1@90 scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions