Replies: 1 comment 1 reply
-
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I installed DeepMD v 1.3.3 with the offline script "deepmd-kit-1.3.3-cuda10.0_gpu-Linux-x86_64.sh" .
Inorder to train a module, i used this command "dp train " on a GPU machine ,which started well with loading the
corresponding CUDA and tensorflow libraries . But the Job did crash with the below error:
Error :
E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(29440, 25), b.shape=(25, 50), m=29440, n=50, k=25
[[node filter_type_0/MatMul_1 (defined at /lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
[[l2_force_test/_59]]
(1) Internal: Blas GEMM launch failed : a.shape=(29440, 25), b.shape=(25, 50), m=29440, n=50, k=25
[[node filter_type_0/MatMul_1 (defined at /lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
0 successful operations.
0 derived errors ignored.
So, i thought to try the next releases, and i got the same error in a couple of next releases. My last try was to install the latest release "deepmd-kit-2.0.0.b0-cuda10.1_gpu-Linux-x86_64.sh". This time, when i trained the same model as previous, the cublas error did not appear. Now , it is complaining about few parameters to be changed in input file.
Error:
2021-05-27 03:06:08.659353: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-05-27 03:06:08.672916: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2245895000 Hz
cuda assert: DeePMD-kit: illegal nbor list sorting /home/conda/feedstock_root/build_artifacts/libdeepmd_1621486666421/work/source/lib/src/cuda/prod_env_mat.cu 509
The input works fine with CPU's.
Could you please suggest on the error ? is there a mismatch with deepmd and tensorflow versions ?
Thanks in Advance
Beta Was this translation helpful? Give feedback.
All reactions