Skip to content

RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly #18

@pascual-tejero

Description

@pascual-tejero

Hello,

I have an error when trying to run the code of vesselformer (train.py) on a GPU Cluster of my university. I tried to reduce the batch size (from 50 to 2), but still, it is not able to train properly.

2023-08-28 12:02:44,306 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
2023-08-28 12:02:44,306 ignite.distributed.launcher.Parallel INFO: - Run '<function main at 0x7f2cd2855670>' in 1 processes
2023-08-28 12:02:44,379 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<dataset_vessel3d.ve':
        {'batch_size': 50, 'shuffle': True, 'num_workers': 16, 'collate_fn': <function image_graph_collate at 0x7f2cd2a1b040>, 'pin_memory': True}
/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2023-08-28 12:02:44,397 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<dataset_vessel3d.ve':
        {'batch_size': 50, 'shuffle': False, 'num_workers': 16, 'collate_fn': <function image_graph_collate at 0x7f2cd2a1b040>, 'pin_memory': True}
2023-08-28 12:02:46,254 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
2023-08-28 12:02:46,254 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
Current run is terminating due to exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
    self.state.batch = next(self._dataloader_iter)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
Engine run is terminating due to exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 753, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 854, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
    raise e
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
    self.state.batch = next(self._dataloader_iter)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
2023-08-28 12:03:30,077 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 205, in <module>
    parallel.run(main, args)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/distributed/launcher.py", line 316, in run
    func(local_rank, *args, **kwargs)
  File "train.py", line 196, in main
    trainer.run()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/engines/trainer.py", line 56, in run
    super().run()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/engines/workflow.py", line 250, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 704, in run
    return self._internal_run()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 783, in _internal_run
    self._handle_exception(e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
    raise e
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 753, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 854, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
    raise e
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
    self.state.batch = next(self._dataloader_iter)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
slurmstepd: error: Detected 9 oom-kill event(s) in StepId=20449.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Any help you can provide will be welcomed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions