Skip to content

Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

@nathanchenseanwalter

Description

@nathanchenseanwalter

Bug description

When I load my checkpoint, it says LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

But I check my slurm jobstats and it's like

GPU utilization per node
stellar-m01g3 (GPU 0): 0% <--- GPU was not used

GPU memory usage per node - maximum used/total
stellar-m01g3 (GPU 0): 12.2GB/40.0GB (30.5%)

I even made sure to later put accelerator: gpu in the trainer section of the yaml file

What version are you seeing the problem on?

v2.5

How to reproduce the bug

cli = ModelCLI(
        subclass_mode_model=True, 
        subclass_mode_data=True,
        parser_kwargs={"parser_mode": "omegaconf"},
        save_config_callback=None,
        )


trainer:
  max_epochs: 10
  accelerator: gpu
  enable_progress_bar: False

#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

module purge
source ...

export SLURM_JOB_ID=$SLURM_JOB_ID

srun python -m specseg.models.train \
    fit \
    --config config/model/config_label.yaml \
    --ckpt_path /path

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions