PyTorch Training Error on Multi-GPU Setup with SLURM: 'No Space Left on Device' Despite Ample Space #19565
Unanswered
eyad-al-shami
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently running multiple experiments across 4 GPUs on a single node managed by Slurm. Each GPU executes a distinct experiment using the same model but with varying hyperparameters, all working on the same dataset. My setup script for each experiment looks something like this:
experiment_1.sbatch:
`#!/bin/bash
#SBATCH -p normal
#SBATCH -o logs/baselines/%j.out
#SBATCH -t 7:00:00
#SBATCH --gres=gpu:full:4
#SBATCH -c 40
CUDA_VISIBLE_DEVICES=0 python3 train.py --parameters 1 &
CUDA_VISIBLE_DEVICES=1 python3 train.py --parameters 2 &
CUDA_VISIBLE_DEVICES=2 python3 train.py --parameters 3 &
CUDA_VISIBLE_DEVICES=3 python3 train.py --parameters 4 &
wait
`
This approach had been working flawlessly until recently when I started encountering failures in some of the jobs. Interestingly, when a job fails, all four commands within it fail simultaneously with the following error related to a "Sanity Check" process::
`Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/home/train.py", line 162, in
trainer.fit(model, datamodule=my_dataset_module)
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1021, in _run_stage
self._run_sanity_check()
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1050, in _run_sanity_check
val_loop.run()
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 98, in run
self.setup_data()
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 168, in setup_data
_check_dataloader_iterable(dl, source, trainer_fn)
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 391, in _check_dataloader_iterable
iter(dataloader) # type: ignore[call-overload]
^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 436, in iter
self._iterator = self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1015, in init
self._worker_result_queue = multiprocessing_context.Queue() # type: ignore[var-annotated]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/queues.py", line 43, in init
self._rlock = ctx.Lock()
^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/synchronize.py", line 167, in init
SemLock.init(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/synchronize.py", line 57, in init
sl = self._semlock = _multiprocessing.SemLock(
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device`
However, I'm confident the issue isn't related to actual disk space, as I have ample free space, and not every job I submit fails with this error.
I would greatly appreciate any insights or suggestions on how to resolve this issue.
Beta Was this translation helpful? Give feedback.
All reactions