-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
Hello, I'm working on a conda environment trying to reproduce training results.
I installed necessary packages, so the code runs fine. Tensorflow also detects my GPU (NVIDIA GeForce RTX 3090).
However, it takes very long time to start training, and I keep getting nan value for the loss and val_loss.
Here is the output I get when the training DOES proceed:
dict_keys(['CAMERA', 'Real', 'coco'])
Epoch 1/100
2024-01-25 13:49:04.065612: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2024-01-25 13:49:04.279901: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2024-01-25 13:49:57.467798: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
196/1000 [====>.........................] - ETA: 41:01 - loss: nan/home/midea/miniconda3/envs/nocs/lib/python3.7/site-packages/scipy/ndimage/interpolation.py:605: UserWarning: From scipy 0.13.0, the output shape of zoom() is calculated with round() instead of int() - for these inputs the size of the returned array has changed.
"the returned array has changed.", UserWarning)
1000/1000 [==============================] - 767s 767ms/step - loss: nan - val_loss: nan
WARNING:tensorflow:From /home/midea/miniconda3/envs/nocs/lib/python3.7/site-packages/keras/callbacks/tensorboard_v1.py:343: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
Epoch 2/100
1000/1000 [==============================] - 184s 184ms/step - loss: nan - val_loss: nan
Epoch 3/100
1000/1000 [==============================] - 187s 187ms/step - loss: nan - val_loss: nan
Epoch 4/100
1000/1000 [==============================] - 188s 188ms/step - loss: nan - val_loss: nan
Epoch 5/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 6/100
1000/1000 [==============================] - 186s 186ms/step - loss: nan - val_loss: nan
Epoch 7/100
1000/1000 [==============================] - 185s 185ms/step - loss: nan - val_loss: nan
Epoch 8/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 9/100
1000/1000 [==============================] - 187s 187ms/step - loss: nan - val_loss: nan
Epoch 10/100
1000/1000 [==============================] - 188s 188ms/step - loss: nan - val_loss: nan
Epoch 11/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 12/100
1000/1000 [==============================] - 185s 185ms/step - loss: nan - val_loss: nan
Epoch 13/100
540/1000 [===============>..............] - ETA: 1:24 - loss: nan
Otherwise, when the training DOES NOT proceed, this is the error I get:
dict_keys(['CAMERA', 'Real', 'coco'])
Epoch 1/100
2024-01-25 15:31:03.464801: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2024-01-25 15:31:03.658949: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2024-01-25 15:31:46.257837: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2024-01-25 15:39:51.069436: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2024-01-25 15:39:51.069476: I tensorflow/stream_executor/stream.cc:4838] [stream=0xe336d00,impl=0xe4f0b40] did not memzero GPU location; source: 0x7f53567fad20
2024-01-25 15:39:51.069481: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7f53567fad30
2024-01-25 15:39:51.069484: I tensorflow/stream_executor/stream.cc:1839] [stream=0xe336d00,impl=0xe4f0b40] did not enqueue 'start timer': 0x7f53567fad30
2024-01-25 15:39:51.069493: I tensorflow/stream_executor/stream.cc:1851] [stream=0xe336d00,impl=0xe4f0b40] did not enqueue 'stop timer': 0x7f53567fad30
2024-01-25 15:39:51.069496: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr
Aborted (core dumped)
The following is the settings of my conda environment:
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
_tflow_select 2.1.0 gpu
absl-py 0.15.0 pyhd3eb1b0_0
astor 0.8.1 py37h06a4308_0
blas 1.1 openblas conda-forge
blosc 1.21.3 h6a678d5_0
bottleneck 1.3.5 py37h7deecbd_0
brotli 1.0.9 h5eee18b_7
brotli-bin 1.0.9 h5eee18b_7
brunsli 0.1 h2531618_0
bzip2 1.0.8 h7b6447c_0
c-ares 1.19.1 h5eee18b_0
ca-certificates 2023.12.12 h06a4308_0
cairo 1.16.0 hb05425b_5
certifi 2022.12.7 py37h06a4308_0
cfitsio 3.470 h5893167_7
charls 2.2.0 h2531618_0
cloudpickle 2.0.0 pyhd3eb1b0_0
cudatoolkit 10.0.130 0
cudnn 7.6.5 cuda10.0_0
cupti 10.0.130 0
cycler 0.11.0 pyhd3eb1b0_0
cython 0.29.33 py37h6a678d5_0
cytoolz 0.12.0 py37h5eee18b_0
dask-core 2021.10.0 pyhd3eb1b0_0
dbus 1.13.18 hb2f20db_0
expat 2.5.0 h6a678d5_0
ffmpeg 4.3.2 h37c90e5_3 conda-forge
fftw 3.3.9 h27cfd23_1
flit-core 3.6.0 pyhd3eb1b0_0
fontconfig 2.14.2 h14ed4e7_0 conda-forge
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.12.1 h4a9f257_0
fsspec 2022.11.0 py37h06a4308_0
gast 0.2.2 py37_0
gettext 0.21.0 hf68c758_0
giflib 5.2.1 h5eee18b_3
glib 2.70.2 h780b84a_4 conda-forge
glib-tools 2.70.2 h780b84a_4 conda-forge
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
google-pasta 0.2.0 pyhd3eb1b0_0
graphite2 1.3.14 h295c915_1
grpcio 1.42.0 py37hce63b2e_0
gst-plugins-base 1.14.5 h0935bb2_2 conda-forge
gstreamer 1.18.5 ha1a6a79_0
h5py 2.10.0 py37hd6299e0_1
harfbuzz 2.9.1 h83ec7ef_1 conda-forge
hdf5 1.10.6 h3ffc7dd_1
icu 68.1 h2531618_0
imagecodecs 2021.8.26 py37hf0132c2_1
imageio 2.19.3 py37h06a4308_0
importlib-metadata 4.11.3 py37h06a4308_0
jasper 1.900.1 hd497a04_4
joblib 1.1.1 py37h06a4308_0
jpeg 9e h5eee18b_1
jxrlib 1.1 h7b6447c_2
keras 2.3.1 0
keras-applications 1.0.8 py_1
keras-base 2.3.1 py37_0
keras-preprocessing 1.1.2 pyhd3eb1b0_0
kiwisolver 1.4.4 py37h6a678d5_0
krb5 1.20.1 h568e23c_1
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libaec 1.0.4 he6710b0_1
libblas 3.9.0 16_linux64_openblas conda-forge
libbrotlicommon 1.0.9 h5eee18b_7
libbrotlidec 1.0.9 h5eee18b_7
libbrotlienc 1.0.9 h5eee18b_7
libcblas 3.9.0 16_linux64_openblas conda-forge
libclang 11.1.0 default_ha53f305_1 conda-forge
libcurl 8.2.1 h91b91d3_0
libdeflate 1.8 h7f8727e_5
libedit 3.1.20230828 h5eee18b_0
libev 4.33 h7f8727e_1
libevent 2.1.10 h9b69904_4 conda-forge
libffi 3.4.4 h6a678d5_0
libgcc-ng 13.2.0 h807b86a_3 conda-forge
libgfortran 3.0.0 1 conda-forge
libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libglib 2.70.2 h174f98d_4 conda-forge
libgomp 13.2.0 h807b86a_3 conda-forge
libiconv 1.16 h7f8727e_2
libidn2 2.3.4 h5eee18b_0
liblapack 3.9.0 16_linux64_openblas conda-forge
liblapacke 3.9.0 16_linux64_openblas conda-forge
libllvm11 11.1.0 h9e868ea_6
libnghttp2 1.52.0 ha637b67_1
libnsl 2.0.0 h5eee18b_0
libopenblas 0.3.21 h043d6bf_0
libopencv 4.5.3 py37h25009ff_1 conda-forge
libpng 1.6.39 h5eee18b_0
libpq 12.15 h37d81fd_1
libprotobuf 3.16.0 h780b84a_0 conda-forge
libssh2 1.10.0 h37d81fd_2
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.19.0 h5eee18b_0
libtiff 4.4.0 hecacb30_2
libunistring 0.9.10 h27cfd23_0
libuuid 2.38.1 h0b41bf4_0 conda-forge
libwebp 1.2.4 h11a3e52_1
libwebp-base 1.2.4 h5eee18b_1
libxcb 1.15 h7f8727e_0
libxkbcommon 1.0.3 he3ba5ed_0 conda-forge
libxml2 2.9.12 h72842e0_0 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
libzopfli 1.0.3 he6710b0_0
locket 1.0.0 py37h06a4308_0
lz4-c 1.9.4 h6a678d5_0
markdown 3.4.1 py37h06a4308_0
markupsafe 2.1.1 py37h7f8727e_0
matplotlib-base 3.5.3 py37hf590b9c_0
munkres 1.1.4 py_0
mysql-common 8.0.29 haf5c9bc_1 conda-forge
mysql-libs 8.0.29 h28c427c_1 conda-forge
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
networkx 2.6.3 pyhd3eb1b0_0
nspr 4.35 h6a678d5_0
nss 3.89.1 h6a678d5_0
numexpr 2.8.4 py37hd2a5715_0
numpy 1.21.5 py37hf838250_3
numpy-base 1.21.5 py37h1e6e340_3
openblas 0.3.3 ha44fe06_1 conda-forge
opencv 4.5.3 py37h89c1867_1 conda-forge
openh264 2.1.1 h4ff587b_0
openjpeg 2.4.0 h3ad879b_0
openssl 1.1.1w h7f8727e_0
opt_einsum 3.3.0 pyhd3eb1b0_1
packaging 22.0 py37h06a4308_0
pandas 1.3.5 py37h8c16a72_0
partd 1.2.0 pyhd3eb1b0_1
pcre 8.45 h295c915_0
pillow 9.4.0 py37h6a678d5_0
pip 22.3.1 py37h06a4308_0
pixman 0.40.0 h7f8727e_1
protobuf 3.16.0 py37hcd2ae1e_0 conda-forge
py-opencv 4.5.3 py37h6531663_1 conda-forge
pycocotools 2.0.4 py37hda87dfa_2 conda-forge
pyparsing 3.0.9 py37h06a4308_0
python 3.7.16 h7a1cb2a_0
python-dateutil 2.8.2 pyhd3eb1b0_0
python_abi 3.7 2_cp37m conda-forge
pytz 2022.7 py37h06a4308_0
pywavelets 1.3.0 py37h7f8727e_0
pyyaml 6.0 py37h5eee18b_1
qt 5.12.9 h9d6b050_2 conda-forge
readline 8.2 h5eee18b_0
scikit-image 0.18.3 py37h51133e4_0
scikit-learn 1.0.2 py37h51133e4_1
scipy 1.2.0 py37_blas_openblashb06ca3d_200 conda-forge
setuptools 65.6.3 py37h06a4308_0
six 1.16.0 pyhd3eb1b0_1
snappy 1.1.10 h6a678d5_1
sqlite 3.41.2 h5eee18b_0
tensorboard 1.14.0 py37hf484d3e_0
tensorflow 1.14.0 gpu_py37h4491b45_0
tensorflow-base 1.14.0 gpu_py37h8d69cac_0
tensorflow-estimator 1.14.0 py_0
tensorflow-gpu 1.14.0 h0d30ee6_0
termcolor 1.1.0 py37h06a4308_1
threadpoolctl 2.2.0 pyh0d69192_0
tifffile 2021.7.2 pyhd3eb1b0_2
tk 8.6.12 h1ccaba5_0
toolz 0.12.0 py37h06a4308_0
typing_extensions 4.4.0 py37h06a4308_0
webencodings 0.5.1 py37_1
werkzeug 0.16.1 py_0
wheel 0.38.4 py37h06a4308_0
wrapt 1.14.1 py37h5eee18b_0
x264 1!161.3030 h7f98852_1 conda-forge
xz 5.4.5 h5eee18b_0
yaml 0.2.5 h7b6447c_0
zfp 0.5.5 h295c915_6
zipp 3.11.0 py37h06a4308_0
zlib 1.2.13 hd590300_5 conda-forge
zstd 1.5.5 hc292b87_0
What have I possibly done wrong?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels