Replies: 1 comment
-
The error is raised by horovod, so you can check whether horovod is linked correctly and works fine. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The following error occurs when I run the dp train input.json command
WARNING:tensorflow:From /home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
bootstrap.exec(spec, module)
Traceback (most recent call last):
File "/home/deepmd/anaconda3/envs/deepmd/bin/dp", line 10, in
sys.exit(main())
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 562, in main
train_dp(**dict_args)
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 74, in train
run_opt = RunOptions(
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/run_options.py", line 97, in init
self.try_init_distrib()
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/run_options.py", line 180, in try_init_distrib
import horovod.tensorflow as HVD
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 26, in
from horovod.tensorflow import elastic
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/elastic.py", line 24, in
from horovod.tensorflow.functions import broadcast_object, broadcast_object_fn, broadcast_variables
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 24, in
from horovod.tensorflow.mpi_ops import allgather, broadcast, broadcast
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 53, in
raise e
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 50, in
MPI_LIB = load_library('mpi_lib' + get_ext_suffix())
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 45, in load_library
library = load_library.load_op_library(filename)
File "/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libmpi.so.12: cannot open shared object file: No such file or directory
**To find the answer, I try to put libtensorflow in the path of/home/deepmd/anaconda3/envs/deepmd/lib/python 3.10/site packages/tensorflow The framework. so. 2 is linked to libmpi.so.12, and the Export LD is added to the bashrc file LIBRARY PATH=$LD LIBRARY PATH: "/home/deepmd/anaconda3/envs/deepmd/lib/python 3.10/site packages/tensorflow/libmpi. so. 12". The following problems are reported. **
WARNING:tensorflow:From /home/deepmd/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/home/deepmd/anaconda3/envs/deepmd/lib/python3.10/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
_bootstrap._exec(spec, module)
Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1538)..............:
MPIDI_OFI_mpi_init_hook(1473):
(unknown)(): Other MPI error
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090959
:
system msg for write_line failure : Bad file descriptor
Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1538)..............:
MPIDI_OFI_mpi_init_hook(1473):
(unknown)(): Other MPI error
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090959
:
system msg for write_line failure : Bad file descriptor
Segmentation fault (core dumped)
Can you give me some ideas on how to solve problems? I will be very grateful.
Beta Was this translation helpful? Give feedback.
All reactions