Unable to run LAMMPS using DPA-2 v300b3 #4092

jjmqz · 2024-09-01T18:27:59Z

jjmqz
Sep 1, 2024

Dear Developers,

I am using DeePMD-kit v3.0.0b3 for pretraining and fine-tuning with DPA-2. The software was installed offline with CUDA 11.8, which matches my CUDA version. Both the pretraining and fine-tuning processes completed successfully, and molecular dynamics simulations with ASE run without issues. However, when I attempt to run LAMMPS with a frozen model file (model.pth) using the command lmp -in in.lmp, I encounter a problem, which also exists in the version that I compiled myself. Could you please help me resolve this issue? Thank you for your assistance.

Here are all the relevant files:
link: https://pan.baidu.com/s/1dfFwZhzANTwI70Pf7We5Hg
extract code: a88t

The output error message of lmp -in in.lmp:

WARNING: There was an error initializing an OpenFabrics device.

Local host: xc06n08
Local device: mlx5_0

LAMMPS (2 Aug 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
DeePMD-kit: Successfully load libcudart.so.12
2024-09-02 01:44:51.820682: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-02 01:44:51.820809: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-02 01:44:51.821687: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loaded 1 plugins from /share/home/yjli/apps/dp-v300b3-cuda124/lib/deepmd_lmp
Reading data file ...
orthogonal box = (0 0 0) to (50 50 50)
1 by 1 by 1 MPI processor grid
reading atoms ...
6 atoms
Finding 1-2 1-3 1-4 neighbors ...
special bond factors lj: 0 0 0
special bond factors coul: 0 0 0
0 = max # of 1-2 neighbors
0 = max # of 1-3 neighbors
0 = max # of 1-4 neighbors
1 = max # of special neighbors
special bonds CPU = 0.000 seconds
read_data CPU = 0.003 seconds
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /share/home/yjli/apps/dp-v300b3-cuda124
source:
source branch: HEAD
source commit: cbf2de6
source commit at: 2024-07-27 05:11:58 +0000
support model ver.: 1.1
build variant: cuda
build with tf inc: /share/home/yjli/apps/dp-v300b3-cuda124/lib/python3.11/site-packages/tensorflow/include;/share/home/yjli/apps/dp-v300b3-cuda124/include
build with tf lib: /share/home/yjli/apps/dp-v300b3-cuda124/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/share/home/yjli/apps/dp-v300b3-cuda124/lib/python3.11/site-packages/torch/lib/libc10.so;/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057349216/_build_env/targets/x86_64-linux/lib/stubs/libcuda.so;/share/home/yjli/apps/dp-v300b3-cuda124/lib/libnvrtc.so;/share/home/yjli/apps/dp-v300b3-cuda124/lib/libnvToolsExt.so;/share/home/yjli/apps/dp-v300b3-cuda124/lib/libcudart.so;/share/home/yjli/apps/dp-v300b3-cuda124/lib/python3.11/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /share/home/yjli/apps/dp-v300b3-cuda124load model from: ./model.pth to gpu 0
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Info of model(s):
using 1 model(s): ./model.pth
rcut in model: 9
ntypes in model: 3

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 2 steps, delay = 10 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 11
ghost atom cutoff = 11
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/nsq
stencil: none
bin: none
WARNING: Proc sub-domain size < neighbor skin, could lead to lost atoms (src/domain.cpp:966)
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0001
WARNING: Communication cutoff adjusted to 11 (src/comm.cpp:732)
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/transform_output.py", line 156, in forward_lower
vvi = split_vv1[_44]
svvi = split_svv1[_44]
_45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )

ffi, aviri, = _45
ffi0 = torch.unsqueeze(ffi, -2)
File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 191, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = annotate(List[Optional[Tensor]], [faked_grad])
_52 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_force = _52[0]
if torch.__isnot__(extended_force, None):

Traceback of TorchScript, original code (most recent call last):
File "/share/home/yjli/apps/dp-v300b3-cuda124/lib/python3.11/site-packages/deepmd/pt/model/model/transform_output.py", line 138, in forward_lower
for vvi, svvi in zip(split_vv1, split_svv1):
# nf x nloc x 3, nf x nloc x 9
ffi, aviri = task_deriv_one(
~~~~~~~~~~~~~~ <--- HERE
vvi,
svvi,
File "/share/home/yjli/apps/dp-v300b3-cuda124/lib/python3.11/site-packages/deepmd/pt/model/model/transform_output.py", line 80, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = torch.jit.annotate(List[Optional[torch.Tensor]], [faked_grad])
extended_force = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <--- HERE
[energy],
[extended_coord],
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
(/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057349216/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             100
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

njzjz · 2024-09-28T16:21:36Z

njzjz
Sep 28, 2024
Maintainer

We'll track the issue in #4167

1 reply

jjmqz Sep 30, 2024
Author

Thank you for looking into it. I look forward to any updates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to run LAMMPS using DPA-2 v300b3 #4092

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unable to run LAMMPS using DPA-2 v300b3 #4092

Uh oh!

Uh oh!

jjmqz Sep 1, 2024

The output error message of lmp -in in.lmp:

Local host: xc06n08 Local device: mlx5_0

Replies: 1 comment · 1 reply

Uh oh!

njzjz Sep 28, 2024 Maintainer

Uh oh!

jjmqz Sep 30, 2024 Author

jjmqz
Sep 1, 2024

Local host: xc06n08
Local device: mlx5_0

Replies: 1 comment 1 reply

njzjz
Sep 28, 2024
Maintainer

jjmqz Sep 30, 2024
Author