Replies: 1 comment
-
This is a bug of CUDA toolkit. As CUDA is not open-source, the only thing you can do is to upgrade CUDA to the latest version and recompile TensorFlow. See #1062 for more details. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have build the model for bilayer-PdSe2 2D material , and I run the dft-md for training set within 300k and 1000k (two group, one for 72 atom and another is 192 atoms), everything goes well but when I set the parameter -n 500 for dp test for the 1000k 192 atoms group ,the error message occurs. This error does not occur when I test on the same test set with models from other groups and the error message will also do not occur when the -n is set as 100. the detail information are attached below.
DEEPMD INFO # ---------------output of dp test---------------
DEEPMD INFO # testing system : /home/ubuntu/PdSe2VASP/bs/5ps/300k/192/deepmd/Pd64Se128/testdata
DEEPMD INFO Adjust batch size from 1024 to 2048
DEEPMD INFO Adjust batch size from 2048 to 4096
DEEPMD INFO Adjust batch size from 4096 to 8192
DEEPMD INFO Adjust batch size from 8192 to 16384
DEEPMD INFO Adjust batch size from 16384 to 32768
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1380, in _do_call
return fn(*args)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1363, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1456, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Blas xGEMV launch failed : a.shape=[1,1131520,25], b.shape=[1,1,25], m=1131520, n=1, k=25
[[{{node load/gradients/filter_type_1/MatMul_4_grad/MatMul}}]]
[[load/o_force/_25]]
(1) INTERNAL: Blas xGEMV launch failed : a.shape=[1,1131520,25], b.shape=[1,1,25], m=1131520, n=1, k=25
[[{{node load/gradients/filter_type_1/MatMul_4_grad/MatMul}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/deepmd/bin/dp", line 10, in
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 479, in main
test(**dict_args)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/entrypoints/test.py", line 82, in test
err = test_ener(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/entrypoints/test.py", line 229, in test_ener
ret = dp.eval(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_pot.py", line 266, in eval
output = self._eval_func(self._eval_inner, numb_test, natoms)(coords, cells, atom_types, fparam = fparam, aparam = aparam, atomic = atomic, efield = efield)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_pot.py", line 199, in eval_func
return self.auto_batch_size.execute_all(inner_func, numb_test, natoms, *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/utils/batch_size.py", line 116, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/utils/batch_size.py", line 66, in execute
n_batch, result = callable(max(self.current_batch_size // natoms, 1), start_index)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/utils/batch_size.py", line 108, in execute_with_batch_size
return (end_index - start_index), callable(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_pot.py", line 380, in _eval_inner
v_out = run_sess(self.sess, t_out, feed_dict = feed_dict_test)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/utils/sess.py", line 21, in run_sess
return sess.run(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 970, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1193, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1373, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1399, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Blas xGEMV launch failed : a.shape=[1,1131520,25], b.shape=[1,1,25], m=1131520, n=1, k=25
[[node load/gradients/filter_type_1/MatMul_4_grad/MatMul
(defined at /home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_eval.py:169)
]]
[[load/o_force/_25]]
(1) INTERNAL: Blas xGEMV launch failed : a.shape=[1,1131520,25], b.shape=[1,1,25], m=1131520, n=1, k=25
[[node load/gradients/filter_type_1/MatMul_4_grad/MatMul
(defined at /home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_eval.py:169)
]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node load/gradients/filter_type_1/MatMul_4_grad/MatMul:
In[0] load/gradients/filter_type_1/Tanh_3_grad/TanhGrad:
In[1] load/filter_type_1/matrix_1_1/read:
Operation defined at: (most recent call last)
Input Source operations connected to node load/gradients/filter_type_1/MatMul_4_grad/MatMul:
In[0] load/gradients/filter_type_1/Tanh_3_grad/TanhGrad:
In[1] load/filter_type_1/matrix_1_1/read:
Operation defined at: (most recent call last)
Original stack trace for 'load/gradients/filter_type_1/MatMul_4_grad/MatMul':
File "/home/ubuntu/miniconda3/envs/deepmd/bin/dp", line 10, in
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 479, in main
test(**dict_args)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/entrypoints/test.py", line 71, in test
dp = DeepPotential(model)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/init.py", line 62, in DeepPotential
dp = DeepPot(mf, load_prefix=load_prefix, default_tf_graph=default_tf_graph)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_pot.py", line 88, in init
DeepEval.init(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_eval.py", line 41, in init
self.graph = self._load_graph(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/deepmd/infer/deep_eval.py", line 169, in _load_graph
tf.import_graph_def(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py", line 552, in new_func
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/importer.py", line 407, in import_graph_def
return _import_graph_def_internal(
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/importer.py", line 520, in _import_graph_def_internal
_ProcessNewOps(graph)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/importer.py", line 251, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3847, in _add_new_tf_operations
new_ops = [
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3848, in
self._create_op_from_tf_operation(c_op, compute_device=compute_devices)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3730, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/home/ubuntu/miniconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2101, in init
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
Beta Was this translation helpful? Give feedback.
All reactions