-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
Description
Bug description
When I load my checkpoint, it says LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
But I check my slurm jobstats and it's like
GPU utilization per node
stellar-m01g3 (GPU 0): 0% <--- GPU was not used
GPU memory usage per node - maximum used/total
stellar-m01g3 (GPU 0): 12.2GB/40.0GB (30.5%)
I even made sure to later put accelerator: gpu in the trainer section of the yaml file
What version are you seeing the problem on?
v2.5
How to reproduce the bug
cli = ModelCLI(
subclass_mode_model=True,
subclass_mode_data=True,
parser_kwargs={"parser_mode": "omegaconf"},
save_config_callback=None,
)
trainer:
max_epochs: 10
accelerator: gpu
enable_progress_bar: False
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
module purge
source ...
export SLURM_JOB_ID=$SLURM_JOB_ID
srun python -m specseg.models.train \
fit \
--config config/model/config_label.yaml \
--ckpt_path /pathError messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
cc @lantiga