-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Description
Bug description
When running a model fit function on a slurm cluster everything happens correctly but when the time is out I receive
Epoch 10: 93%|█████████▎| 18625/20000 [5:43:21<25:20, 0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1
Epoch 10: 93%|█████████▎| 18626/20000 [5:43:21<25:19, 0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1.490, gen_emb_independence
slurmstepd: error: *** STEP 55595933.0 ON maestro-3017 CANCELLED AT 2025-01-10T15:27:25 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[rank: 0] Received SIGTERM: 15
Bypassing SIGTERM: 15
Epoch 10: 93%|█████████▎| 18626/20000 [5:43:21<25:19, 0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10: 93%|█████████▎| 18627/20000 [5:43:23<25:18, 0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10: 93%|█████████▎| 18627/20000 [5:43:23<25:18, 0.90it/s, v_num=0txx, train_loss=3.330, denoise_60%_expr=1.270, denoise_60%_emb_independence=0.0689, denoise_60%_cls=0.368, denoise_60%_ecs=0.867, gen_expr=1.460, gen_emb_independence=0.0584, gen_ecs=0.868, cce=0.480]
wandb: 🚀 View run super-dream-58 at: https://wandb.ai/ml4ig/scprint_v2/runs/k2oz0txx
wandb: Find logs at: ../../../../zeus/projets/p02/ml4ig_hot/Users/jkalfon/wandb/run-20250107_152923-k2oz0txx/logs
Unfortunately the model never requeues and doesn't even save a checkpoint...
It seems I don't have to add anything here in my config.yml but even when adding
plugins:
- class_path: lightning.pytorch.plugins.environments.SLURMEnvironment
init_args:
requeue_signal: SIGHUP
it doesn't change anything, I have also specified --signal=SIGUSR1@90
in my sbatch cmd.
Is there a solution?
What version are you seeing the problem on?
v2.4
How to reproduce the bug
git clone https://github.com/cantinilab/scPRINT
follow installation instruction
sbatch -p gpu -q gpu --gres=gpu:A100:1,gmem:80G --cpus-per-task 20 --mem-per-gpu 80G --ntasks-per-node=1 --signal=SIGUSR1@90 scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml
cc @lantiga