Skip to content

KeyError on Single GPU Setup and Clarification on Rank Environment Variable #31

@mariem-m11

Description

@mariem-m11

I am attempting to train using a single GPU setup. I modified the config file to utilize nnUNetTrainerV2 instead of nnUNetTrainerV2_DDP.

And modified the train.sh like so:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 torchrun ./train.py --task="Task180_BraTSMet" --fold=${fold} --config=$CONFIG --network="3d_fullres" --resume='' --local-rank=0 --optim_name="adam" --valbest --val_final --npz

The script throws a KeyError when trying to access the patch_size from the plan_data['plans']['plans_per_stage'][resolution_index].

Error Message:

Using configuration: configs/Brats/decoder_only.yaml
Running on fold: 0

['/kaggle/working/nnUNet/nnunet/nnUNet_raw_data_base/nnUNet_raw_data/training/network_training'] nnUNetTrainerV2 nnunet.training.network_training

I am running the following nnUNet: 3d_fullres
My trainer class is:  <class 'nn_transunet.trainer.nnUNetTrainerV2.nnUNetTrainerV2'>
For that I will be using the following configuration:
I am using stage 0 from these plans
I am using sample dice + CE loss

I am using data from this folder: /kaggle/working/nnUNet/nnunet/preprocessed/Task180_BraTSMet/nnUNetData_plans_v2.1

Traceback (most recent call last):
  File "/kaggle/working/3D-TransUNet/./train.py", line 321, in <module>
    main()
  File "/kaggle/working/3D-TransUNet/./train.py", line 202, in main
    patch_size = plan_data['plans']['plans_per_stage'][resolution_index]['patch_size']
KeyError: 1
[2024-08-13 16:59:32,745] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1841) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib.python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib.python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(this._config, this._entrypoint, list(args))
  File "/opt/conda/lib.python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-13_16:59:32
  host      : ec35b85052cc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1841)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions