-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingstrategy: deepspeedver: 2.4.x
Description
Bug description
I'm encountering the error "FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" while trying to resume training using PyTorch Lightning with strategy='deepspeed_stage_2'. My training script saves only a .ckpt file, but DeepSpeed seems to require additional files for restoring checkpoints.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
def do_train(args):
"train one stage"
cfg = setup(args)
cfg.exp_name = os.path.basename(args.output_dir)
if args.use_wandb:
logger = WandbLogger(project='demo',
offline=True,
name=cfg.exp_name,
resume="allow" if args.resume_ckpt else None,
entity="team")
else:
logger = TensorBoardLogger("./tb_logs",name="demo",version=cfg.exp_name)
model = SSLFLArch(cfg)
ckpt_cb = ModelCheckpoint(dirpath=f'./ckpts/{cfg.exp_name}',
filename='{epoch:d}',
every_n_epochs=10,
save_top_k=-1)
callbacks = [ckpt_cb]
trainer = PL.Trainer(max_epochs=cfg.optim["epochs"],
callbacks=callbacks,
logger=logger,
enable_model_summary=False,
precision=16 if cfg.compute_precision.grad_scaler else 32,
log_every_n_steps=10,
accelerator='gpu',
devices=[0, 1, 2, 3],
strategy='deepspeed_stage_2',
)
trainer.fit(model,
ckpt_path=args.resume_ckpt,
)Error messages and logs
"FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint"
Environment
- CUDA:
- GPU:
- NVIDIA A100 80GB PCIe
- NVIDIA A100 80GB PCIe
- NVIDIA A100 80GB PCIe
- NVIDIA A100 80GB PCIe
- available: True
- version: 12.1 - Lightning:
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.3.0
- torch: 2.4.0
- torchaudio: 2.3.0
- torchmetrics: 1.4.0.post0
- torchvision: 0.19.0 - Packages:
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- asttokens: 2.4.1
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- brotli: 1.0.9
- cachetools: 5.5.0
- certifi: 2024.7.4
- charset-normalizer: 3.3.2
- click: 8.1.7
- comm: 0.2.2
- debugpy: 1.6.7
- decorator: 5.1.1
- deepspeed: 0.15.0
- docker-pycreds: 0.4.0
- exceptiongroup: 1.2.2
- executing: 2.1.0
- filelock: 3.13.1
- fsspec: 2024.3.1
- gitdb: 4.0.11
- gitpython: 3.1.43
- gmpy2: 2.1.2
- hjson: 3.1.0
- idna: 3.7
- importlib-metadata: 8.4.0
- importlib-resources: 6.4.0
- inflect: 7.3.1
- ipykernel: 6.29.5
- ipython: 8.27.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jedi: 0.19.1
- jinja2: 3.1.4
- jupyter-client: 8.6.2
- jupyter-core: 5.7.2
- lightning-utilities: 0.9.0
- markupsafe: 2.1.3
- matplotlib-inline: 0.1.7
- mkl-fft: 1.3.8
- mkl-random: 1.2.4
- mkl-service: 2.4.0
- more-itertools: 10.3.0
- mpmath: 1.3.0
- nest-asyncio: 1.6.0
- networkx: 3.3
- ninja: 1.11.1.1
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-ml-py: 12.535.161
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.20
- nvidia-nvtx-cu12: 12.1.105
- nvitop: 1.3.2
- omegaconf: 2.3.0
- ordered-set: 4.1.0
- packaging: 24.1
- pandas: 2.2.2
- parso: 0.8.4
- pexpect: 4.9.0
- pickleshare: 0.7.5
- pillow: 10.4.0
- pip: 24.2
- platformdirs: 4.2.2
- prompt-toolkit: 3.0.47
- protobuf: 5.27.3
- psutil: 6.0.0
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- py-cpuinfo: 9.0.0
- pydantic: 2.8.2
- pydantic-core: 2.20.1
- pygments: 2.18.0
- pysocks: 1.7.1
- python-dateutil: 2.9.0
- pytorch-lightning: 2.3.0
- pytz: 2024.2
- pyyaml: 6.0.1
- pyzmq: 25.1.2
- requests: 2.32.3
- sentry-sdk: 2.13.0
- setproctitle: 1.3.3
- setuptools: 72.1.0
- six: 1.16.0
- smmap: 5.0.1
- stack-data: 0.6.2
- sympy: 1.12
- termcolor: 2.4.0
- tomli: 2.0.1
- torch: 2.4.0
- torchaudio: 2.3.0
- torchmetrics: 1.4.0.post0
- torchvision: 0.19.0
- tornado: 6.4.1
- tqdm: 4.66.4
- traitlets: 5.14.3
- triton: 3.0.0
- typeguard: 4.3.0
- typing-extensions: 4.11.0
- tzdata: 2024.1
- urllib3: 2.2.2
- wandb: 0.17.7
- wcwidth: 0.2.13
- wheel: 0.43.0
- xformers: 0.0.27.post2
- zipp: 3.20.1 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.11.9
- release: 5.15.0-87-generic
- version: Support for multiple val_dataloadersย #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023
More info
No response
cc @lantiga
zhanwenchen
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingstrategy: deepspeedver: 2.4.x