-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingfabriclightning.fabric.Fabriclightning.fabric.Fabricstrategy: fsdpFully Sharded Data ParallelFully Sharded Data Parallelver: 2.5.x
Description
Bug description
Other ParallelStrategy implementations support saving and loading checkpoints from an S3 URL using fsspec. Fabric’s FSDP strategy also uses torch.load and torch.save, which support S3 URLs as well. However, the FSDP wrapper in Fabric converts the input path to a pathlib.Path object. This conversion changes a valid URL like s3://bucket/xxx into an invalid format like s3:/bucket/xxx, effectively breaking the path.
To resolve this issue, path manipulation should be handled using os.path instead—as done in TorchCheckpointIO.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
- Use FSDP as the strategy for fabric.
- Save a checkpoint with `fabric.save("s3://xxx")`
Error messages and logs
# Error messages and logs here please
Environment
No response
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingfabriclightning.fabric.Fabriclightning.fabric.Fabricstrategy: fsdpFully Sharded Data ParallelFully Sharded Data Parallelver: 2.5.x