-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
accelerator: cudaCompute Unified Device Architecture GPUCompute Unified Device Architecture GPUbugSomething isn't workingSomething isn't workingcallback: model checkpointstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.3.x
Description
Bug description
I’m using pytorch lighting DDP training with batch size = 16, 8 (gpu per node) * 2 (2 nodes) = 16 total gpus. However, I got the following
error, which happens in ModelCheckpoint callback. There seems to be an error during synchronization between nodes when saving the model checkpoint. And I decreased the batch size to 4 and this error disappeared. Can anyone help me?
- type: ModelCheckpoint
every_n_train_steps: 2000
save_top_k: 30
monitor: "step"
filename: "checkpoint_{epoch}-{step}"
Stack:
[rank2]: Traceback (most recent call last):
[rank2]: File "/workspace/[email protected]/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 130, in <module>
[rank2]: main()
[rank2]: File "/workspace/[email protected]/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 121, in main
[rank2]: runner.train(resume_from=ckpt_path)
[rank2]: File "/workspace/[email protected]/xpilot_vision/ai_foundation/projects/e2e_aeb/flow/runner/xflow_runner.py", line 38, in train
[rank2]: self.trainer.fit(
[rank2]: File "/workspace/[email protected]/xpilot_vision/ai_foundation/xflow/xflow/lightning/trainer/xflow_trainer.py", line 356, in fit
[rank2]: super().fit(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
[rank2]: call._call_and_handle_interrupt(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]: return function(*args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
[rank2]: self._run(model, ckpt_path=ckpt_path)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
[rank2]: results = self._run_stage()
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
[rank2]: self.fit_loop.run()
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 206, in run
[rank2]: self.on_advance_end()
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 378, in on_advance_end
[rank2]: call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 210, in _call_callback_hooks
[rank2]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 323, in on_train_epoch_end
[rank2]: self._save_topk_checkpoint(trainer, monitor_candidates)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank2]: self._save_monitor_checkpoint(trainer, monitor_candidates)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 703, in _save_monitor_checkpoint
[rank2]: self._update_best_and_save(current, trainer, monitor_candidates)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 732, in _update_best_and_save
[rank2]: filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer, del_filepath)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 661, in _get_metric_interpolated_filepath_name
[rank2]: while self.file_exists(filepath, trainer) and filepath != del_filepath:
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 774, in file_exists
[rank2]: return trainer.strategy.broadcast(exists)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank2]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2636, in broadcast_object_list
[rank2]: object_tensor = torch.empty( # type: ignore[call-overload]
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.
What version are you seeing the problem on?
v2.3
How to reproduce the bug
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
Metadata
Metadata
Assignees
Labels
accelerator: cudaCompute Unified Device Architecture GPUCompute Unified Device Architecture GPUbugSomething isn't workingSomething isn't workingcallback: model checkpointstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.3.x