-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.4.x
Description
Bug description
I Upgraded my pl verison from 2.0.4 to 2.4. The same training code will have the error below
2024-09-06 23:01:53 [on_validation_epoch_end] collecting matched instance from 64 ranks
2024-09-06 23:02:17 [on_validation_epoch_end] end
Traceback (most recent call last):
File "/usr/lib/python3.10/shutil.py", line 816, in move
os.rename(src, real_dst)
OSError: [Errno 18] Invalid cross-device link: '/tmp/tmp7ue72_bg' -> '/xxx/model/ckpt-epoch=000.ckpt'
When I check the corresponding code difference, I find the difference below
@@ -65,8 +78,69 @@ def _atomic_save(checkpoint: Dict[str, Any], filepath: Union[str, Path]) -> None
accepts.
filepath: The path to which the checkpoint will be saved.
This points to the file that the checkpoint will be stored in.
+
"""
bytesbuffer = io.BytesIO()
+ log.debug(f"Saving checkpoint: {filepath}")
torch.save(checkpoint, bytesbuffer)
- with fsspec.open(filepath, "wb") as f:
+
+ # We use a transaction here to avoid file corruption if the save gets interrupted
+ fs, urlpath = fsspec.core.url_to_fs(str(filepath))
+ with fs.transaction, fs.open(urlpath, "wb") as f:
f.write(bytesbuffer.getvalue())
It seems that the using of transaction will cause the error when moving a file between different filesystem. Is there any solution?
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
mrefish, abefrandsen, mattsqerror, leonmkim, AlanBlanchet and 5 moreantonzub99
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.4.x