Replies: 1 comment 1 reply
-
I can reproduce the error. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dear deepmd-kit developers:
I installed deepmd-kit 2.0.0-gpu-cuda-11.3 and test the code with examples/water/se_e3. Error raised as "failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE"
deepmd-kit run well with examples/water/se_e2_a and my jobs. How can I train models with se_e3 three body discrpitor?
best regards!
erro information:
021-08-29 16:00:20.818358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:21:00.0 name: NVIDIA GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.77GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-08-29 16:00:20.818899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-29 16:00:20.818949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-29 16:00:20.818958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-08-29 16:00:20.818963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-08-29 16:00:20.819528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8110 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:21:00.0, compute capability: 8.6)
DEEPMD INFO initialize model from scratch
DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
2021-08-29 16:00:21.436975: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-29 16:00:21.983103: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-29 16:00:22.006086: E tensorflow/stream_executor/cuda/cuda_blas.cc:564] failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE
Traceback (most recent call last):
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMV launch failed : a.shape=[1,1228800,2], b.shape=[1,1,2], m=1228800, n=1, k=2
[[{{node gradients/filter_type_all/MatMul_6_grad/MatMul}}]]
(1) Internal: Blas xGEMV launch failed : a.shape=[1,1228800,2], b.shape=[1,1,2], m=1228800, n=1, k=2
[[{{node gradients/filter_type_all/MatMul_6_grad/MatMul}}]]
[[l2_force_test/_39]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/deepmd-kit/bin/dp", line 10, in
sys.exit(main())
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
train_dp(**dict_args)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
_do_work(jdata, run_opt, is_compress)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 163, in _do_work
model.train(train_data, valid_data)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/train/trainer.py", line 506, in train
self.valid_on_the_fly(fp, [train_batch], valid_batches, print_header=True)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/train/trainer.py", line 600, in valid_on_the_fly
train_results = self.get_evaluation_results(train_batches)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/train/trainer.py", line 652, in get_evaluation_results
results = self.loss.eval(self.sess, feed_dict, natoms)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/loss/ener.py", line 140, in eval
error, error_e, error_f, error_v, error_ae, error_pf = run_sess(sess, run_data, feed_dict=feed_dict)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/deepmd/utils/sess.py", line 20, in run_sess
return sess.run(*args, **kwargs)
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/user/deepmd-kit/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMV launch failed : a.shape=[1,1228800,2], b.shape=[1,1,2], m=1228800, n=1, k=2
[[node gradients/filter_type_all/MatMul_6_grad/MatMul (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_t.py:352) ]]
(1) Internal: Blas xGEMV launch failed : a.shape=[1,1228800,2], b.shape=[1,1,2], m=1228800, n=1, k=2
[[node gradients/filter_type_all/MatMul_6_grad/MatMul (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_t.py:352) ]]
[[l2_force_test/_39]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node gradients/filter_type_all/MatMul_6_grad/MatMul:
filter_type_all/matrix_1_1_1/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
Input Source operations connected to node gradients/filter_type_all/MatMul_6_grad/MatMul:
filter_type_all/matrix_1_1_1/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
Original stack trace for 'gradients/filter_type_all/MatMul_6_grad/MatMul':
File "/bin/dp", line 10, in
sys.exit(main())
File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
train_dp(**dict_args)
File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
_do_work(jdata, run_opt, is_compress)
File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 158, in _do_work
model.build(train_data, stop_batch)
File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 338, in build
self._build_network(data)
File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 362, in _build_network
= self.model.build (self.place_holders['coord'],
File "/lib/python3.9/site-packages/deepmd/model/ener.py", line 229, in build
= self.descrpt.prod_force_virial (atom_ener, natoms)
File "/lib/python3.9/site-packages/deepmd/descriptor/se_t.py", line 352, in prod_force_virial
[net_deriv] = tf.gradients (atom_ener, self.descrpt_reshape)
File "/lib/python3.9/site-packages/tensorflow/python/ops/gradients_impl.py", line 169, in gradients
return gradients_util._GradientsHelper(
File "/lib/python3.9/site-packages/tensorflow/python/ops/gradients_util.py", line 681, in _GradientsHelper
in_grads = _MaybeCompile(grad_scope, op, func_call,
File "/lib/python3.9/site-packages/tensorflow/python/ops/gradients_util.py", line 338, in _MaybeCompile
return grad_fn() # Exit early
File "/lib/python3.9/site-packages/tensorflow/python/ops/gradients_util.py", line 682, in
lambda: grad_fn(op, *out_grads))
File "/lib/python3.9/site-packages/tensorflow/python/ops/math_grad.py", line 1733, in _MatMulGrad
grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True)
File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
ret = Operation(
File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in init
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
...which was originally created as op 'filter_type_all/MatMul_6', defined at:
File "/bin/dp", line 10, in
sys.exit(main())
[elided 4 identical lines from previous traceback]
File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 362, in build_network
= self.model.build (self.place_holders['coord'],
File "/lib/python3.9/site-packages/deepmd/model/ener.py", line 159, in build
= self.descrpt.build(coord,
File "/lib/python3.9/site-packages/deepmd/descriptor/se_t.py", line 315, in build
self.dout, self.qmat = self._pass_filter(self.descrpt_reshape,
File "/lib/python3.9/site-packages/deepmd/descriptor/se_t.py", line 387, in _pass_filter
layer, qmat = self._filter(tf.cast(inputs_i, self.filter_precision), type_i, name='filter_type_all'+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
File "/lib/python3.9/site-packages/deepmd/descriptor/se_t.py", line 498, in _filter
ebd_env_ij = embedding_net(ebd_env_ij,
File "/lib/python3.9/site-packages/deepmd/utils/network.py", line 176, in embedding_net
hidden = tf.reshape(activation_fn(tf.matmul(xx, w) + b), [-1, outputs_size[ii]])
File "/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/lib/python3.9/site-packages/tensorflow/python/ops/math_ops.py", line 3489, in matmul
return gen_math_ops.mat_mul(
File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
_, _, _op, _outputs = _op_def_library._apply_op_helper(
Beta Was this translation helpful? Give feedback.
All reactions