-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingver: 2.4.xver: 2.5.xwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update
Description
Bug description
In FSDPStrategy.save_checkpoint
, the filepath
variable is transformed via
path = Path(self.broadcast(filepath)) |
This only makes sense if doing sharded checkpointing, and in fact mangles any legitimate fsspec path that is passed in.
When self._state_dict_type == "full"
,
super().save_checkpoint(checkpoint=checkpoint, filepath=path)
is called, using the normal CheckpointIO workflow, but with the mangled path.
The expected behavior should be that if the user chooses full state dict type, CheckpointIO and remote paths should work as usual, but currently full state dict checkpoints cannot be saved to remote paths.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
trainer = L.Trainer(
strategy="fsdp"
default_root_dir="s3://example/path"
)
trainer.fit(model=...)
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingver: 2.4.xver: 2.5.xwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update