Skip to content

fabric FSDP strategy save/load checkpoint does not support s3 url #20749

@likesum

Description

@likesum

Bug description

Other ParallelStrategy implementations support saving and loading checkpoints from an S3 URL using fsspec. Fabric’s FSDP strategy also uses torch.load and torch.save, which support S3 URLs as well. However, the FSDP wrapper in Fabric converts the input path to a pathlib.Path object. This conversion changes a valid URL like s3://bucket/xxx into an invalid format like s3:/bucket/xxx, effectively breaking the path.

To resolve this issue, path manipulation should be handled using os.path instead—as done in TorchCheckpointIO.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

- Use FSDP as the strategy for fabric.
- Save a checkpoint with `fabric.save("s3://xxx")`

Error messages and logs

# Error messages and logs here please

Environment

No response

More info

No response

cc @lantiga @justusschock

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcheckpointingRelated to checkpointingfabriclightning.fabric.Fabricstrategy: fsdpFully Sharded Data Parallelver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions