Skip to content

BUG: Without internet, neptune gets stuck in multi-gpu pytorch lightning and crashes the run #20747

@simon-ging

Description

@simon-ging

Bug description

Describe the bug

If the neptune client loses connection to the neptune server for any reason, it waits forever on GPU0 to get the connection back. This crashes the training run at the end of the training epoch.

Reproduction

Train anything with PyTorch Lightning in a multi-gpu settings.

Use neptune as a logger.

Have an internet connection that stalls forever (e.g. blocked ports).

At the end of the training epoch wait another 30 min until you get a NCCL TIMEOUT where the GPUs 1-7 get tired of waiting for GPU 0 and crash.

Expected behavior

The client should somehow gracefully store data locally instead of tearing everything down.

Traceback

If applicable, add traceback or log output/screenshots to help explain your problem.

Environment

The output of pip list:

Package                   Version        Editable project location
------------------------- -------------- -------------------------------
accelerate                1.4.0
aiohappyeyeballs          2.5.0
aiohttp                   3.11.13
aiosignal                 1.3.2
annotated-types           0.7.0
antlr4-python3-runtime    4.9.3
anyio                     4.8.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
astroid                   3.3.9
asttokens                 3.0.0
async-lru                 2.0.4
attrs                     25.1.0
babel                     2.17.0
beautifulsoup4            4.13.3
bitsandbytes              0.45.3
black                     25.1.0
bleach                    6.2.0
blinker                   1.9.0
blis                      1.2.0
boto3                     1.37.9
botocore                  1.37.9
braceexpand               0.1.7
bravado                   11.1.0
bravado-core              6.1.1
cachetools                5.5.2
cairocffi                 1.7.1
CairoSVG                  2.7.1
catalogue                 2.0.10
certifi                   2025.1.31
cffi                      1.17.1
cfgv                      3.4.0
chardet                   5.2.0
charset-normalizer        3.4.1
click                     8.1.8
cloudpathlib              0.21.0
comm                      0.2.2
confection                0.1.5
contourpy                 1.3.1
coverage                  7.6.12
cssselect2                0.8.0
cycler                    0.12.1
cymem                     2.0.11
dash                      2.18.2
dash-bootstrap-components 1.7.1
dash-core-components      2.0.0
dash-html-components      2.0.0
dash-table                5.0.0
datasets                  3.3.2
debugpy                   1.8.14
decorator                 5.2.1
defusedxml                0.7.1
Deprecated                1.2.18
dill                      0.3.8
distlib                   0.3.9
executing                 2.2.0
fastjsonschema            2.21.1
fasttext-numpy2           0.10.4
filelock                  3.17.0
Flask                     3.0.3
flask-sock                0.7.0
fonttools                 4.56.0
fqdn                      1.5.1
frozenlist                1.5.0
fsspec                    2024.12.0
ftfy                      6.3.1
future                    1.0.0
gitdb                     4.0.12
GitPython                 3.1.44
google-api-core           2.24.1
google-api-python-client  2.163.0
google-auth               2.38.0
google-auth-httplib2      0.2.0
googleapis-common-protos  1.69.1
greenlet                  3.1.1
h11                       0.14.0
h5py                      3.13.0
httpcore                  1.0.7
httplib2                  0.22.0
httpx                     0.28.1
huggingface-hub           0.29.2
identify                  2.6.9
idna                      3.10
importlib_metadata        8.6.1
importlib_resources       6.5.2
iniconfig                 2.0.0
ipdb                      0.13.13
ipykernel                 6.29.5
ipython                   9.0.2
ipython_pygments_lexers   1.1.1
ipywidgets                8.1.5
isoduration               20.11.0
isort                     6.0.1
itables                   2.2.5
itsdangerous              2.2.0
jedi                      0.19.2
Jinja2                    3.1.6
jmespath                  1.0.1
joblib                    1.4.2
json5                     0.10.0
jsonpointer               3.0.0
jsonref                   1.1.0
jsonschema                4.23.0
jsonschema-specifications 2024.10.1
jupyter                   1.1.1
jupyter_client            8.6.3
jupyter-console           6.6.3
jupyter_core              5.7.2
jupyter-events            0.12.0
jupyter-lsp               2.2.5
jupyter_server            2.15.0
jupyter_server_terminals  0.5.3
jupyterlab                4.3.5
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
jupyterlab_widgets        3.0.13
kiwisolver                1.4.8
langcodes                 3.5.0
language_data             1.3.0
lightning                 2.5.0.post0
lightning-utilities       0.14.0
lmdb                      1.6.2
loguru                    0.7.3
lxml                      5.3.1
marisa-trie               1.2.1
markdown-it-py            3.0.0
MarkupSafe                3.0.2
matplotlib                3.10.1
matplotlib-inline         0.1.7
maturin                   1.8.2
mccabe                    0.7.0
mdurl                     0.1.2
memory-tempfile           2.2.3
mistune                   3.1.2
monotonic                 1.6
mpmath                    1.3.0
msgpack                   1.1.0
multidict                 6.1.0
multiprocess              0.70.16
murmurhash                1.0.12
mypy-extensions           1.0.0
narwhals                  1.29.1
natsort                   8.4.0
nbclient                  0.10.2
nbconvert                 7.16.6
nbformat                  5.10.4
neptune                   1.13.0
nest-asyncio              1.6.0
networkx                  3.4.2
nibabel                   5.3.2
nilearn                   0.11.1
nltk                      3.9.1
nodeenv                   1.9.1
notebook                  7.3.2
notebook_shim             0.2.4
numpy                     2.2.3
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-cusparselt-cu12    0.6.2
nvidia-ml-py              12.570.86
nvidia-ml-py3             7.352.0
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127
oauthlib                  3.2.2
omegaconf                 2.3.0
overrides                 7.7.0
packaging                 24.2
pandas                    2.2.3
pandocfilters             1.5.1
parso                     0.8.4
pathspec                  0.12.1
peft                      0.14.0
pexpect                   4.9.0
pillow                    11.1.0
pillow-avif-plugin        1.5.0
pip                       25.0.1
platformdirs              4.3.6
plotly                    6.0.0
pluggy                    1.5.0
pre_commit                4.1.0
preshed                   3.0.9
prometheus_client         0.21.1
prompt_toolkit            3.0.50
propcache                 0.3.0
proto-plus                1.26.0
protobuf                  5.29.3
psutil                    7.0.0
psycopg2-binary           2.9.10
ptyprocess                0.7.0
pure_eval                 0.2.3
pyarrow                   19.0.1
pyasn1                    0.6.1
pyasn1_modules            0.4.1
pybind11                  2.13.6
pycocoevalcap             1.2
pycocotools               2.0.8
pycparser                 2.22
pydantic                  2.10.6
pydantic_core             2.27.2
pydub                     0.25.1
Pygments                  2.19.1
pyinstrument              5.0.1
PyJWT                     2.10.1
pylint                    3.3.5
pyparsing                 3.2.1
pytest                    8.3.5
pytest-cov                6.0.0
python-dateutil           2.9.0.post0
python-dotenv             1.0.1
python-json-logger        3.3.0
pytorch-lightning         2.5.0.post0
PyTurboJPEG               1.7.7
pytz                      2025.1
PyYAML                    6.0.2
pyzmq                     26.2.1
referencing               0.36.2
regex                     2024.11.6
requests                  2.32.3
requests-oauthlib         2.0.0
retrying                  1.3.4
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.9.4
rpds-py                   0.23.1
rsa                       4.9
s3transfer                0.11.4
safetensors               0.5.3
scikit-learn              1.6.1
scipy                     1.15.2
seaborn                   0.13.2
Send2Trash                1.8.3
sentencepiece             0.2.0
setuptools                75.8.2
shellingham               1.5.4
simple-websocket          1.1.0
simplejson                3.20.1
six                       1.17.0
smart-open                7.1.0
smmap                     5.0.2
sniffio                   1.3.1
soupsieve                 2.6
spacy                     3.8.4
spacy-legacy              3.0.12
spacy-loggers             1.0.5
SQLAlchemy                2.0.38
srsly                     2.5.1
stack-data                0.6.3
swagger-spec-validator    3.0.4
sympy                     1.13.1
tabulate                  0.9.0
termcolor                 2.5.0
terminado                 0.18.1
thinc                     8.3.4
threadpoolctl             3.5.0
timm                      1.0.15
tinycss2                  1.4.0
tokenize_rt               6.1.0
tokenizers                0.21.0
tomlkit                   0.13.2
torch                     2.6.0
torchaudio                2.6.0
torchmetrics              1.6.2
torchvision               0.21.0
tornado                   6.4.2
tqdm                      4.67.1
traitlets                 5.14.3
transformers              4.49.0
treelib                   1.7.1
triton                    3.2.0
typer                     0.15.2
types-python-dateutil     2.9.0.20241206
typing_extensions         4.12.2
tzdata                    2025.1
uri-template              1.3.0
uritemplate               4.1.1
urllib3                   2.3.0
virtualenv                20.29.3
wasabi                    1.1.3
wcwidth                   0.2.13
weasel                    0.4.1
webcolors                 24.11.1
webdataset                0.2.111
webencodings              0.5.1
websocket-client          1.8.0
Werkzeug                  3.0.6
wheel                     0.45.1
widgetsnbextension        4.0.13
wrapt                     1.17.2
wsproto                   1.2.0
xxhash                    3.5.0
yarl                      1.18.3
zipp                      3.21.0
zstandard                 0.23.0

The operating system you're using:

Ubuntu 22.04

The output of python --version:

Python 3.12.9

Additional context

Training on a SLURM cluster.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

No response

Environment

No response

More info

neptune-client team asked me to move this issue to pytorch-lightning

neptune-ai/neptune-client#1918

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions