Skip to content

_atomic_save with transaction cause "Invalid cross-device link" errorΒ #20270

@RichardChe

Description

@RichardChe

Bug description

I Upgraded my pl verison from 2.0.4 to 2.4. The same training code will have the error below

2024-09-06 23:01:53 [on_validation_epoch_end] collecting matched instance from 64 ranks
2024-09-06 23:02:17 [on_validation_epoch_end] end
Traceback (most recent call last):
  File "/usr/lib/python3.10/shutil.py", line 816, in move
    os.rename(src, real_dst)
OSError: [Errno 18] Invalid cross-device link: '/tmp/tmp7ue72_bg' -> '/xxx/model/ckpt-epoch=000.ckpt'

When I check the corresponding code difference, I find the difference below

@@ -65,8 +78,69 @@ def _atomic_save(checkpoint: Dict[str, Any], filepath: Union[str, Path]) -> None
             accepts.
         filepath: The path to which the checkpoint will be saved.
             This points to the file that the checkpoint will be stored in.
+
     """
     bytesbuffer = io.BytesIO()
+    log.debug(f"Saving checkpoint: {filepath}")
     torch.save(checkpoint, bytesbuffer)
-    with fsspec.open(filepath, "wb") as f:
+
+    # We use a transaction here to avoid file corruption if the save gets interrupted
+    fs, urlpath = fsspec.core.url_to_fs(str(filepath))
+    with fs.transaction, fs.open(urlpath, "wb") as f:
         f.write(bytesbuffer.getvalue())

It seems that the using of transaction will cause the error when moving a file between different filesystem. Is there any solution?

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.4.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions