Skip to content

FSDP full state dict mangles fsspec path #20406

@oceanusxiv

Description

@oceanusxiv

Bug description

In FSDPStrategy.save_checkpoint, the filepath variable is transformed via

path = Path(self.broadcast(filepath))

This only makes sense if doing sharded checkpointing, and in fact mangles any legitimate fsspec path that is passed in.

When self._state_dict_type == "full",

super().save_checkpoint(checkpoint=checkpoint, filepath=path)

is called, using the normal CheckpointIO workflow, but with the mangled path.

The expected behavior should be that if the user chooses full state dict type, CheckpointIO and remote paths should work as usual, but currently full state dict checkpoints cannot be saved to remote paths.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

trainer = L.Trainer(
        strategy="fsdp"
        default_root_dir="s3://example/path"
    )

trainer.fit(model=...)

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions