Skip to content

When interrupting a run with Ctrl+C, sometimes the WandbLogger does not upload a checkpoint artifactΒ #20425

@edmcman

Description

@edmcman

Bug description

When interrupting a run with Ctrl+C, the WandbLogger does not upload a checkpoint artifact

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

Epoch 20:  28%|β–ˆβ–ˆβ–     | 6502/23178 [29:11<1:14:53,  3.71it/s, v_num=gwj7, train_loss=nan.0]^C
Detected KeyboardInterrupt, attempting graceful shutdown ...
wandb: πŸš€ View run train-release-0.1 at: https://wandb.ai/eschwartz/dire/runs/uvexgwj7
Epoch 20:  28%|β–ˆβ–ˆβ–     | 6502/23178 [29:16<1:15:04,  3.70it/s, v_num=gwj7, train_loss=nan.0]

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 4070 Laptop GPU
    • available: True
    • version: 12.1
  • Lightning:
    • lightning-utilities: 0.11.7
    • pytorch-lightning: 2.4.0
    • torch: 2.3.0
    • torchmetrics: 1.6.0
  • Packages:
    • absl-py: 2.1.0
    • aiohappyeyeballs: 2.4.3
    • aiohttp: 3.10.10
    • aiosignal: 1.3.1
    • appdirs: 1.4.4
    • asttokens: 2.4.1
    • async-timeout: 4.0.3
    • attrs: 23.2.0
    • braceexpand: 0.1.7
    • certifi: 2024.2.2
    • charset-normalizer: 3.3.2
    • click: 8.1.7
    • decorator: 5.1.1
    • docker-pycreds: 0.4.0
    • docopt: 0.6.2
    • editdistance: 0.5.3
    • et-xmlfile: 1.1.0
    • exceptiongroup: 1.2.2
    • executing: 2.1.0
    • filelock: 3.13.4
    • frozenlist: 1.5.0
    • fsspec: 2024.3.1
    • future: 1.0.0
    • gitdb: 4.0.11
    • gitpython: 3.1.43
    • grpcio: 1.62.2
    • hjson: 3.1.0
    • idna: 3.7
    • ipdb: 0.13.13
    • ipython: 8.27.0
    • jedi: 0.19.1
    • jep: 4.2.0
    • jinja2: 3.1.3
    • jsonlines: 4.0.0
    • jsonnet: 0.16.0
    • lightning-utilities: 0.11.7
    • markdown: 3.6
    • markdown-it-py: 2.2.0
    • markupsafe: 2.1.5
    • matplotlib-inline: 0.1.7
    • mdurl: 0.1.2
    • mpmath: 1.3.0
    • msgpack: 1.0.8
    • multidict: 6.1.0
    • networkx: 3.3
    • numpy: 1.26.4
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 8.9.2.26
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.20.5
    • nvidia-nvjitlink-cu12: 12.4.127
    • nvidia-nvtx-cu12: 12.1.105
    • objectio: 0.2.29
    • openpyxl: 3.1.2
    • packaging: 24.1
    • pandas: 2.2.2
    • parso: 0.8.4
    • pexpect: 4.9.0
    • pillow: 10.3.0
    • pip: 22.0.2
    • platformdirs: 4.3.6
    • prompt-toolkit: 3.0.47
    • propcache: 0.2.0
    • protobuf: 4.25.3
    • psutil: 5.9.8
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.3
    • pyelftools: 0.31
    • pygments: 2.6.1
    • python-dateutil: 2.9.0.post0
    • pytorch-lightning: 2.4.0
    • pytz: 2024.1
    • pyyaml: 6.0.1
    • requests: 2.31.0
    • rich: 13.2.0
    • sentencepiece: 0.1.99
    • sentry-sdk: 2.0.1
    • setproctitle: 1.3.3
    • setuptools: 59.6.0
    • shellingham: 1.5.4
    • simplejson: 3.19.2
    • six: 1.16.0
    • smmap: 5.0.1
    • stack-data: 0.6.3
    • sympy: 1.12
    • tensorboard: 2.16.2
    • tensorboard-data-server: 0.7.2
    • tomli: 2.0.1
    • torch: 2.3.0
    • torchmetrics: 1.6.0
    • tqdm: 4.66.2
    • traitlets: 5.14.3
    • triton: 2.3.0
    • typer: 0.12.3
    • typing-extensions: 4.11.0
    • tzdata: 2024.1
    • ujson: 3.2.0
    • urllib3: 2.2.1
    • wandb: 0.18.6
    • wcwidth: 0.2.13
    • webdataset: 0.2.100
    • werkzeug: 3.0.2
    • yarl: 1.16.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.10.12
    • release: 6.8.0-48-generic
    • version: Update README.mdΒ #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2

More info

No response

cc @lantiga @morganmcg1 @borisdayma @scottire @parambharat

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions