Skip to content

slurm env incorrectly complains about srun with salloc interactive session. #20776

@profPlum

Description

@profPlum

Bug description

I get an error message (see below) complaining that I'm not using srun, preventing me from running my parallel code. But in reality I did use srun, but I'm also in interactive mode with salloc. In order to get past this error I had to modify the file: /u/ddeighan/miniforge3/envs/uqops/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py to manually disable checking if the allocation was in interactive mode.

There needs to be better sanity checking code to tell the difference between srun -n1 --pty bash (invalid, not really using srun) & salloc ...; srun python train.py (valid, using srun properly)

What version are you seeing the problem on?

v2.5

How to reproduce the bug

salloc -N2 --gpus=2 ...
srun python train_parallel.py

Error messages and logs

/u/ddeighan/miniforge3/envs/uqops/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python channel.py ...

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions