Skip to content

Clean up of old checkpoints is crashing #20

@evellasques

Description

@evellasques

I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:

ModelCheckpoint(
                save_top_k=args.num_kept_checkpoint,
                 monitor="global_step",
                 mode="max",
                 every_n_train_steps=args.checkpoint_freq,
                 dirpath=args.checkpoint_dir,
                 enable_version_counter=False,
             )
         )

The problem is, when the limit defined in save_top_k is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint() https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:

fs = get_filesystem(path)
        if fs.exists(path):
            fs.rm(path, recursive=True)
            log.debug(f"Removed checkpoint: {path}")

but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:

 File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    self._run(model, ckpt_path=ckpt_path)

As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).

Here is relevant info about my environment:

pip freeze:

neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0

Neuron libraries:

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]

Metadata

Metadata

Assignees

No one assigned

    Labels

    TrainingbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions