Skip to content

"FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" when using strategy='deepspeed_stage_2'ย #20453

@ShiweiWu98

Description

@ShiweiWu98

Bug description

I'm encountering the error "FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" while trying to resume training using PyTorch Lightning with strategy='deepspeed_stage_2'. My training script saves only a .ckpt file, but DeepSpeed seems to require additional files for restoring checkpoints.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

def do_train(args):
    "train one stage"
    cfg = setup(args)
    cfg.exp_name = os.path.basename(args.output_dir)
    if args.use_wandb:
        logger = WandbLogger(project='demo',
                             offline=True,
                             name=cfg.exp_name,
                             resume="allow" if args.resume_ckpt else None,
                             entity="team")
    else:
        logger = TensorBoardLogger("./tb_logs",name="demo",version=cfg.exp_name)
    model = SSLFLArch(cfg)
    ckpt_cb = ModelCheckpoint(dirpath=f'./ckpts/{cfg.exp_name}',
                              filename='{epoch:d}',
                              every_n_epochs=10,
                              save_top_k=-1)
    callbacks = [ckpt_cb]
    trainer = PL.Trainer(max_epochs=cfg.optim["epochs"],
                         callbacks=callbacks,
                         logger=logger,
                         enable_model_summary=False,
                         precision=16 if cfg.compute_precision.grad_scaler else 32,
                         log_every_n_steps=10,
                         accelerator='gpu',
                         devices=[0, 1, 2, 3],
                         strategy='deepspeed_stage_2',
                         )
    trainer.fit(model,
                ckpt_path=args.resume_ckpt,
                )

Error messages and logs

"FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint"

Environment

  • CUDA:
    - GPU:
    - NVIDIA A100 80GB PCIe
    - NVIDIA A100 80GB PCIe
    - NVIDIA A100 80GB PCIe
    - NVIDIA A100 80GB PCIe
    - available: True
    - version: 12.1
  • Lightning:
    - lightning-utilities: 0.9.0
    - pytorch-lightning: 2.3.0
    - torch: 2.4.0
    - torchaudio: 2.3.0
    - torchmetrics: 1.4.0.post0
    - torchvision: 0.19.0
  • Packages:
    - annotated-types: 0.7.0
    - antlr4-python3-runtime: 4.9.3
    - asttokens: 2.4.1
    - autocommand: 2.2.2
    - backports.tarfile: 1.2.0
    - brotli: 1.0.9
    - cachetools: 5.5.0
    - certifi: 2024.7.4
    - charset-normalizer: 3.3.2
    - click: 8.1.7
    - comm: 0.2.2
    - debugpy: 1.6.7
    - decorator: 5.1.1
    - deepspeed: 0.15.0
    - docker-pycreds: 0.4.0
    - exceptiongroup: 1.2.2
    - executing: 2.1.0
    - filelock: 3.13.1
    - fsspec: 2024.3.1
    - gitdb: 4.0.11
    - gitpython: 3.1.43
    - gmpy2: 2.1.2
    - hjson: 3.1.0
    - idna: 3.7
    - importlib-metadata: 8.4.0
    - importlib-resources: 6.4.0
    - inflect: 7.3.1
    - ipykernel: 6.29.5
    - ipython: 8.27.0
    - jaraco.context: 5.3.0
    - jaraco.functools: 4.0.1
    - jaraco.text: 3.12.1
    - jedi: 0.19.1
    - jinja2: 3.1.4
    - jupyter-client: 8.6.2
    - jupyter-core: 5.7.2
    - lightning-utilities: 0.9.0
    - markupsafe: 2.1.3
    - matplotlib-inline: 0.1.7
    - mkl-fft: 1.3.8
    - mkl-random: 1.2.4
    - mkl-service: 2.4.0
    - more-itertools: 10.3.0
    - mpmath: 1.3.0
    - nest-asyncio: 1.6.0
    - networkx: 3.3
    - ninja: 1.11.1.1
    - numpy: 1.26.4
    - nvidia-cublas-cu12: 12.1.3.1
    - nvidia-cuda-cupti-cu12: 12.1.105
    - nvidia-cuda-nvrtc-cu12: 12.1.105
    - nvidia-cuda-runtime-cu12: 12.1.105
    - nvidia-cudnn-cu12: 9.1.0.70
    - nvidia-cufft-cu12: 11.0.2.54
    - nvidia-curand-cu12: 10.3.2.106
    - nvidia-cusolver-cu12: 11.4.5.107
    - nvidia-cusparse-cu12: 12.1.0.106
    - nvidia-ml-py: 12.535.161
    - nvidia-nccl-cu12: 2.20.5
    - nvidia-nvjitlink-cu12: 12.6.20
    - nvidia-nvtx-cu12: 12.1.105
    - nvitop: 1.3.2
    - omegaconf: 2.3.0
    - ordered-set: 4.1.0
    - packaging: 24.1
    - pandas: 2.2.2
    - parso: 0.8.4
    - pexpect: 4.9.0
    - pickleshare: 0.7.5
    - pillow: 10.4.0
    - pip: 24.2
    - platformdirs: 4.2.2
    - prompt-toolkit: 3.0.47
    - protobuf: 5.27.3
    - psutil: 6.0.0
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.3
    - py-cpuinfo: 9.0.0
    - pydantic: 2.8.2
    - pydantic-core: 2.20.1
    - pygments: 2.18.0
    - pysocks: 1.7.1
    - python-dateutil: 2.9.0
    - pytorch-lightning: 2.3.0
    - pytz: 2024.2
    - pyyaml: 6.0.1
    - pyzmq: 25.1.2
    - requests: 2.32.3
    - sentry-sdk: 2.13.0
    - setproctitle: 1.3.3
    - setuptools: 72.1.0
    - six: 1.16.0
    - smmap: 5.0.1
    - stack-data: 0.6.2
    - sympy: 1.12
    - termcolor: 2.4.0
    - tomli: 2.0.1
    - torch: 2.4.0
    - torchaudio: 2.3.0
    - torchmetrics: 1.4.0.post0
    - torchvision: 0.19.0
    - tornado: 6.4.1
    - tqdm: 4.66.4
    - traitlets: 5.14.3
    - triton: 3.0.0
    - typeguard: 4.3.0
    - typing-extensions: 4.11.0
    - tzdata: 2024.1
    - urllib3: 2.2.2
    - wandb: 0.17.7
    - wcwidth: 0.2.13
    - wheel: 0.43.0
    - xformers: 0.0.27.post2
    - zipp: 3.20.1
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.11.9
    - release: 5.15.0-87-generic
    - version: Support for multiple val_dataloadersย #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions