Error at the first step when run MD simulation using i-pi with DeePMD model #2655

plumbum082 · 2023-07-05T03:59:53Z

plumbum082
Jul 5, 2023

Hello developers,
I have a problem when I run MD simulations using i-pi with my DeePMD model, it gives an error at the first step, but when I do single point calculation with the same code for DeepPot, there is no problem. I have confirmed that the shape of the data recieved by DPDriver.grad is correct and checked that using this data as input of single point calculation the result is also correct. I can not understand why it can not run. And I have checked using 'nvidia-smi' that my driver is compatible with cudatoolkit 11.3.1. I put the client_DP.py and error file below. Could you help me with it? Thank you very much!
this is my client_DP.py,
import os
import sys
import driver
import numpy as np
from deepmd.infer import DeepPot

class DPDriver(driver.BaseDriver):
def init(self, addr, port, socktype):
driver.BaseDriver.init(self, port, addr, socktype)
return

def grad(self, crd, cell): # receive SI input, return SI values
    crd = np.array(crd*1e10) # convert to angstrom
    cell = np.array(cell*1e10)
    dp = DeepPot("graph02.pb")
    coord = crd.reshape([1, -1])
    box = cell.reshape([1, -1])
    atype = np.ones((crd.shape[0],), dtype=np.int32)
    atype[0::3] = 0
    energy, grad, virial = dp.eval(coord, box, atype)

    energy = energy[0][0] /27.211386245988*2625.5
    grad = grad[0] /27.211386245988*2625.5
    virial = virial[0] /27.211386245988*2625.5
    print(energy,grad,virial)

    # convert to SI
    energy = np.array(energy * 1000 / 6.0221409e+23) # kj/mol to Joules
    grad = np.array(-grad * 1000 / 6.0221409e+23 * 1e10) # convert kj/mol/A to joule/m
    virial = np.array(virial * 1000 / 6.0221409e+23) # kj/mol to Joules
    #print(grad) 
    return energy, grad, virial

if name == 'main':
addr = sys.argv[1]
port = int(sys.argv[2])
socktype = sys.argv[3]
driver_dp = DPDriver(addr, port, socktype)
while True:
driver_dp.parse()

the bug says,

2023-07-04 12:01:51.234067: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at custom_op.cc:15 : INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
2023-07-04 12:01:51.234106: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
[[{{node load/ProdEnvMatA}}]]
2023-07-04 12:01:51.234125: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
[[{{node load/ProdEnvMatA}}]]
[[load/o_virial/_27]]
Traceback (most recent call last):
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
return fn(*args)
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
[[{{node load/ProdEnvMatA}}]]
[[load/o_virial/_27]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/input_lbg-12765-7738088/client_DP.py", line 58, in
driver_dp.parse()
File "/home/input_lbg-12765-7738088/driver.py", line 195, in parse
self.posdata()
File "/home/input_lbg-12765-7738088/driver.py", line 143, in posdata
energy, grad, virial = self.grad(self.crd, self.cell)
File "/home/input_lbg-12765-7738088/client_DP.py", line 34, in grad
energy, grad, virial = dp.eval(coord, box, atype)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 373, in eval
output = self._eval_func(self._eval_inner, numb_test, natoms)(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 288, in eval_func
return self.auto_batch_size.execute_all(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 191, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 103, in execute
n_batch, result = callable(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 169, in execute_with_batch_size
return (end_index - start_index), callable(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 526, in _eval_inner
v_out = run_sess(self.sess, t_out, feed_dict=feed_dict_test)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/sess.py", line 30, in run_sess
return sess.run(*args, **kwargs)
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 968, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1191, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
[[{{node load/ProdEnvMatA}}]]
[[load/o_virial/_27]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /project/source/op/custom_op.cc:17
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'load/ProdEnvMatA':

2023-07-04 12:02:06.641526: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /opt/mamba/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/opt/mamba/lib/python3.10/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
_bootstrap._exec(spec, module)

njzjz · 2023-07-05T20:39:33Z

njzjz
Jul 5, 2023
Maintainer

In your code, you initialize DeepPot whenever you use it. It is recommended only to initialize once as it is expensive. But this should not be related to the error. Could you provide a sample to reproduce the error?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error at the first step when run MD simulation using i-pi with DeePMD model #2655

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error at the first step when run MD simulation using i-pi with DeePMD model #2655

Uh oh!

plumbum082 Jul 5, 2023

Replies: 1 comment

Uh oh!

njzjz Jul 5, 2023 Maintainer

plumbum082
Jul 5, 2023

njzjz
Jul 5, 2023
Maintainer