-
Notifications
You must be signed in to change notification settings - Fork 19
Description
I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:
ModelCheckpoint(
save_top_k=args.num_kept_checkpoint,
monitor="global_step",
mode="max",
every_n_train_steps=args.checkpoint_freq,
dirpath=args.checkpoint_dir,
enable_version_counter=False,
)
)
The problem is, when the limit defined in save_top_k is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint() https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:
fs = get_filesystem(path)
if fs.exists(path):
fs.rm(path, recursive=True)
log.debug(f"Removed checkpoint: {path}")
but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:
File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
self._run(model, ckpt_path=ckpt_path)
As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).
Here is relevant info about my environment:
pip freeze:
neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0
Neuron libraries:
aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]