-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
I get an error message (see below) complaining that I'm not using srun, preventing me from running my parallel code. But in reality I did use srun, but I'm also in interactive mode with salloc. In order to get past this error I had to modify the file: /u/ddeighan/miniforge3/envs/uqops/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py
to manually disable checking if the allocation was in interactive mode.
There needs to be better sanity checking code to tell the difference between srun -n1 --pty bash
(invalid, not really using srun) & salloc ...; srun python train.py
(valid, using srun properly)
What version are you seeing the problem on?
v2.5
How to reproduce the bug
salloc -N2 --gpus=2 ...
srun python train_parallel.py
Error messages and logs
/u/ddeighan/miniforge3/envs/uqops/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun
command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun
like so: srun python channel.py ...
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
cc @lantiga