Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory

### Bug description

I’m using pytorch lighting DDP training with batch size = 16, 8 (gpu per node) * 2 (2 nodes) = 16 total gpus. However, I got the following
error, which happens in ModelCheckpoint callback. There seems to be an error during synchronization between nodes when saving the model checkpoint. And I decreased the batch size to 4 and this error disappeared. Can anyone help me?

```
      - type: ModelCheckpoint
        every_n_train_steps: 2000
        save_top_k: 30
        monitor: "step"
        filename: "checkpoint_{epoch}-{step}"
```

Stack:
```
[rank2]: Traceback (most recent call last):
[rank2]:   File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 130, in <module>
[rank2]:     main()
[rank2]:   File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 121, in main
[rank2]:     runner.train(resume_from=ckpt_path)
[rank2]:   File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/projects/e2e_aeb/flow/runner/xflow_runner.py", line 38, in train
[rank2]:     self.trainer.fit(
[rank2]:   File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/xflow/xflow/lightning/trainer/xflow_trainer.py", line 356, in fit
[rank2]:     super().fit(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
[rank2]:     call._call_and_handle_interrupt(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]:     return function(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
[rank2]:     self._run(model, ckpt_path=ckpt_path)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
[rank2]:     results = self._run_stage()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
[rank2]:     self.fit_loop.run()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 206, in run
[rank2]:     self.on_advance_end()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 378, in on_advance_end
[rank2]:     call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 210, in _call_callback_hooks
[rank2]:     fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 323, in on_train_epoch_end
[rank2]:     self._save_topk_checkpoint(trainer, monitor_candidates)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank2]:     self._save_monitor_checkpoint(trainer, monitor_candidates)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 703, in _save_monitor_checkpoint
[rank2]:     self._update_best_and_save(current, trainer, monitor_candidates)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 732, in _update_best_and_save
[rank2]:     filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer, del_filepath)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 661, in _get_metric_interpolated_filepath_name
[rank2]:     while self.file_exists(filepath, trainer) and filepath != del_filepath:
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 774, in file_exists
[rank2]:     return trainer.strategy.broadcast(exists)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank2]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2636, in broadcast_object_list
[rank2]:     object_tensor = torch.empty(  # type: ignore[call-overload]
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.
```


### What version are you seeing the problem on?

v2.3

### How to reproduce the bug

```python

```

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```

</details>


### More info

_No response_

cc @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory #20558

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions