Commit 0ac7fcf
committed
Fix ModelCheckpoint file_exists OOM in DDP
- Use strategy.reduce_boolean_decision instead of broadcast in ModelCheckpoint.file_exists
- Ensure only global rank 0 touches the filesystem
- Avoid broadcast_object_list for a simple boolean in DDP
- Add a small DDP test with monitor=None to exercise this path1 parent 8f702b3 commit 0ac7fcf
File tree
2 files changed
+33
-2
lines changed- src/lightning/pytorch/callbacks
- tests/tests_pytorch/checkpointing
2 files changed
+33
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
999 | 999 | | |
1000 | 1000 | | |
1001 | 1001 | | |
1002 | | - | |
1003 | | - | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
1004 | 1010 | | |
1005 | 1011 | | |
1006 | 1012 | | |
| |||
Lines changed: 25 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
0 commit comments