-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
Describe the bug
If the neptune client loses connection to the neptune server for any reason, it waits forever on GPU0 to get the connection back. This crashes the training run at the end of the training epoch.
Reproduction
Train anything with PyTorch Lightning in a multi-gpu settings.
Use neptune as a logger.
Have an internet connection that stalls forever (e.g. blocked ports).
At the end of the training epoch wait another 30 min until you get a NCCL TIMEOUT where the GPUs 1-7 get tired of waiting for GPU 0 and crash.
Expected behavior
The client should somehow gracefully store data locally instead of tearing everything down.
Traceback
If applicable, add traceback or log output/screenshots to help explain your problem.
Environment
The output of pip list
:
Package Version Editable project location
------------------------- -------------- -------------------------------
accelerate 1.4.0
aiohappyeyeballs 2.5.0
aiohttp 3.11.13
aiosignal 1.3.2
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.8.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
astroid 3.3.9
asttokens 3.0.0
async-lru 2.0.4
attrs 25.1.0
babel 2.17.0
beautifulsoup4 4.13.3
bitsandbytes 0.45.3
black 25.1.0
bleach 6.2.0
blinker 1.9.0
blis 1.2.0
boto3 1.37.9
botocore 1.37.9
braceexpand 0.1.7
bravado 11.1.0
bravado-core 6.1.1
cachetools 5.5.2
cairocffi 1.7.1
CairoSVG 2.7.1
catalogue 2.0.10
certifi 2025.1.31
cffi 1.17.1
cfgv 3.4.0
chardet 5.2.0
charset-normalizer 3.4.1
click 8.1.8
cloudpathlib 0.21.0
comm 0.2.2
confection 0.1.5
contourpy 1.3.1
coverage 7.6.12
cssselect2 0.8.0
cycler 0.12.1
cymem 2.0.11
dash 2.18.2
dash-bootstrap-components 1.7.1
dash-core-components 2.0.0
dash-html-components 2.0.0
dash-table 5.0.0
datasets 3.3.2
debugpy 1.8.14
decorator 5.2.1
defusedxml 0.7.1
Deprecated 1.2.18
dill 0.3.8
distlib 0.3.9
executing 2.2.0
fastjsonschema 2.21.1
fasttext-numpy2 0.10.4
filelock 3.17.0
Flask 3.0.3
flask-sock 0.7.0
fonttools 4.56.0
fqdn 1.5.1
frozenlist 1.5.0
fsspec 2024.12.0
ftfy 6.3.1
future 1.0.0
gitdb 4.0.12
GitPython 3.1.44
google-api-core 2.24.1
google-api-python-client 2.163.0
google-auth 2.38.0
google-auth-httplib2 0.2.0
googleapis-common-protos 1.69.1
greenlet 3.1.1
h11 0.14.0
h5py 3.13.0
httpcore 1.0.7
httplib2 0.22.0
httpx 0.28.1
huggingface-hub 0.29.2
identify 2.6.9
idna 3.10
importlib_metadata 8.6.1
importlib_resources 6.5.2
iniconfig 2.0.0
ipdb 0.13.13
ipykernel 6.29.5
ipython 9.0.2
ipython_pygments_lexers 1.1.1
ipywidgets 8.1.5
isoduration 20.11.0
isort 6.0.1
itables 2.2.5
itsdangerous 2.2.0
jedi 0.19.2
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
json5 0.10.0
jsonpointer 3.0.0
jsonref 1.1.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.12.0
jupyter-lsp 2.2.5
jupyter_server 2.15.0
jupyter_server_terminals 0.5.3
jupyterlab 4.3.5
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.13
kiwisolver 1.4.8
langcodes 3.5.0
language_data 1.3.0
lightning 2.5.0.post0
lightning-utilities 0.14.0
lmdb 1.6.2
loguru 0.7.3
lxml 5.3.1
marisa-trie 1.2.1
markdown-it-py 3.0.0
MarkupSafe 3.0.2
matplotlib 3.10.1
matplotlib-inline 0.1.7
maturin 1.8.2
mccabe 0.7.0
mdurl 0.1.2
memory-tempfile 2.2.3
mistune 3.1.2
monotonic 1.6
mpmath 1.3.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.16
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.29.1
natsort 8.4.0
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
neptune 1.13.0
nest-asyncio 1.6.0
networkx 3.4.2
nibabel 5.3.2
nilearn 0.11.1
nltk 3.9.1
nodeenv 1.9.1
notebook 7.3.2
notebook_shim 0.2.4
numpy 2.2.3
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-cusparselt-cu12 0.6.2
nvidia-ml-py 12.570.86
nvidia-ml-py3 7.352.0
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
oauthlib 3.2.2
omegaconf 2.3.0
overrides 7.7.0
packaging 24.2
pandas 2.2.3
pandocfilters 1.5.1
parso 0.8.4
pathspec 0.12.1
peft 0.14.0
pexpect 4.9.0
pillow 11.1.0
pillow-avif-plugin 1.5.0
pip 25.0.1
platformdirs 4.3.6
plotly 6.0.0
pluggy 1.5.0
pre_commit 4.1.0
preshed 3.0.9
prometheus_client 0.21.1
prompt_toolkit 3.0.50
propcache 0.3.0
proto-plus 1.26.0
protobuf 5.29.3
psutil 7.0.0
psycopg2-binary 2.9.10
ptyprocess 0.7.0
pure_eval 0.2.3
pyarrow 19.0.1
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybind11 2.13.6
pycocoevalcap 1.2
pycocotools 2.0.8
pycparser 2.22
pydantic 2.10.6
pydantic_core 2.27.2
pydub 0.25.1
Pygments 2.19.1
pyinstrument 5.0.1
PyJWT 2.10.1
pylint 3.3.5
pyparsing 3.2.1
pytest 8.3.5
pytest-cov 6.0.0
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-json-logger 3.3.0
pytorch-lightning 2.5.0.post0
PyTurboJPEG 1.7.7
pytz 2025.1
PyYAML 6.0.2
pyzmq 26.2.1
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
retrying 1.3.4
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
s3transfer 0.11.4
safetensors 0.5.3
scikit-learn 1.6.1
scipy 1.15.2
seaborn 0.13.2
Send2Trash 1.8.3
sentencepiece 0.2.0
setuptools 75.8.2
shellingham 1.5.4
simple-websocket 1.1.0
simplejson 3.20.1
six 1.17.0
smart-open 7.1.0
smmap 5.0.2
sniffio 1.3.1
soupsieve 2.6
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.38
srsly 2.5.1
stack-data 0.6.3
swagger-spec-validator 3.0.4
sympy 1.13.1
tabulate 0.9.0
termcolor 2.5.0
terminado 0.18.1
thinc 8.3.4
threadpoolctl 3.5.0
timm 1.0.15
tinycss2 1.4.0
tokenize_rt 6.1.0
tokenizers 0.21.0
tomlkit 0.13.2
torch 2.6.0
torchaudio 2.6.0
torchmetrics 1.6.2
torchvision 0.21.0
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.49.0
treelib 1.7.1
triton 3.2.0
typer 0.15.2
types-python-dateutil 2.9.0.20241206
typing_extensions 4.12.2
tzdata 2025.1
uri-template 1.3.0
uritemplate 4.1.1
urllib3 2.3.0
virtualenv 20.29.3
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webcolors 24.11.1
webdataset 0.2.111
webencodings 0.5.1
websocket-client 1.8.0
Werkzeug 3.0.6
wheel 0.45.1
widgetsnbextension 4.0.13
wrapt 1.17.2
wsproto 1.2.0
xxhash 3.5.0
yarl 1.18.3
zipp 3.21.0
zstandard 0.23.0
The operating system you're using:
Ubuntu 22.04
The output of python --version
:
Python 3.12.9
Additional context
Training on a SLURM cluster.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
No response
Environment
No response
More info
neptune-client team asked me to move this issue to pytorch-lightning