-
Notifications
You must be signed in to change notification settings - Fork 576
Open
Labels
Description
Bug summary
Succeeded in building deepmd-kit 3.1.1 from source with tensorflow rocm backend supported and dp train works well. But lammps with built-in deepmd-kit encountered an error, it could find AMD GPU but suddenly halted with tensorflow's internal error Non-OK-status: RegisterAlreadyLocked(op_data_factory) status: ALREADY_EXISTS: Op with name Gelu.
DeePMD-kit Version
DeePMD-kit v3.1.1
Backend and its version
tensorflow-rocm 2.16.1
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/LiBr-aq-box-298K/npt_eq$ lmp -in in.npt_eq
2025-10-26 16:36:40.296306: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
LAMMPS (2 Aug 2023 - Update 1)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
orthogonal box = (0 0 0) to (33.97 33.97 33.97)
1 by 1 by 1 MPI processor grid
reading atoms ...
3650 atoms
read_data CPU = 0.004 seconds
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Summary of lammps deepmd module ...
>>> Info of deepmd-kit:
installed to: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310
source: v3.1.1
source branch: HEAD
source commit: bfa62458
source commit at: 2025-10-01 01:40:47 +0800
support model ver.: 1.1
build variant: rocm
build with tf inc: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/include;/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/include
build with tf lib: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
>>> Info of lammps module:
use deepmd-kit at: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310
source:
source branch:
source commit:
source commit at:
build with inc:
build with lib:
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2025-10-26 16:36:41.335696: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-26 16:36:41.336511: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:36:41.371404: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:36:41.371445: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:36:41.371497: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:36:41.371518: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:36:41.371544: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:36:41.371559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14270 MB memory: -> device: 0, name: AMD Radeon RX 7800 XT, pci bus id: 0000:03:00.0
2025-10-26 16:36:41.585174: F tensorflow/core/framework/op.cc:213] Non-OK-status: RegisterAlreadyLocked(op_data_factory) status: ALREADY_EXISTS: Op with name Gelu
[mizubai-MS-7E28:44114] *** Process received signal ***
[mizubai-MS-7E28:44114] Signal: Aborted (6)
[mizubai-MS-7E28:44114] Signal code: (-6)
[mizubai-MS-7E28:44114] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x755d0bc42520]
[mizubai-MS-7E28:44114] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x755d0bc969fc]
[mizubai-MS-7E28:44114] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x755d0bc42476]
[mizubai-MS-7E28:44114] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x755d0bc287f3]
[mizubai-MS-7E28:44114] [ 4] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZN3tsl8internal15LogMessageFatalD2Ev+0x20)[0x755d0add1340]
[mizubai-MS-7E28:44114] [ 5] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZTv0_n24_N3tsl8internal15LogMessageFatalD1Ev+0x0)[0x755d0add1360]
[mizubai-MS-7E28:44114] [ 6] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZNK10tensorflow10OpRegistry16MustCallDeferredEv+0x1d3)[0x755d0a234283]
[mizubai-MS-7E28:44114] [ 7] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZNK10tensorflow10OpRegistry10LookUpSlowERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x42)[0x755d0a233d22]
[mizubai-MS-7E28:44114] [ 8] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZNK10tensorflow10OpRegistry6LookUpERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPPKNS_18OpRegistrationDataE+0x1e)[0x755d0a23394e]
[mizubai-MS-7E28:44114] [ 9] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZNK10tensorflow25FunctionLibraryDefinition6LookUpERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPPKNS_18OpRegistrationDataE+0x69)[0x755d0a04dc09]
[mizubai-MS-7E28:44114] [10] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZNK10tensorflow19OpRegistryInterface11LookUpOpDefERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPPKNS_5OpDefE+0x2a)[0x755d0a2332ba]
[mizubai-MS-7E28:44114] [11] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZN10tensorflow25AddDefaultAttrsToGraphDefEPNS_8GraphDefERKNS_19OpRegistryInterfaceEib+0x81)[0x755d0a05bb11]
[mizubai-MS-7E28:44114] [12] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZN10tensorflow25AddDefaultAttrsToGraphDefEPNS_8GraphDefERKNS_19OpRegistryInterfaceEi+0x11)[0x755d0a05ba81]
[mizubai-MS-7E28:44114] [13] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2(_ZN10tensorflow19GraphExecutionState16MakeForBaseGraphEONS_8GraphDefERKNS_26GraphExecutionStateOptionsEPSt10unique_ptrIS0_St14default_deleteIS0_EE+0x87)[0x755d0a76a597]
[mizubai-MS-7E28:44114] [14] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession12ExtendLockedEONS_8GraphDefE+0x1ca)[0x755d06f4491a]
[mizubai-MS-7E28:44114] [15] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession6CreateEONS_8GraphDefE+0xc5)[0x755d06f44715]
[mizubai-MS-7E28:44114] [16] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession6CreateERKNS_8GraphDefE+0x31)[0x755d06f44631]
[mizubai-MS-7E28:44114] [17] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_cc.so(_ZN6deepmd9DeepPotTF4initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKiS8_+0x385)[0x755d0b7348b5]
[mizubai-MS-7E28:44114] [18] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_cc.so(_ZN6deepmd9DeepPotTFC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKiS8_+0x13c)[0x755d0b73520c]
[mizubai-MS-7E28:44114] [19] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKiS8_+0x1de)[0x755d0b71a0ee]
[mizubai-MS-7E28:44114] [20] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPotC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKiS8_+0x4c)[0x755d0b71a33c]
[mizubai-MS-7E28:44114] [21] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_c.so(DP_NewDeepPotWithParam2+0xed)[0x755d0cb5dced]
[mizubai-MS-7E28:44114] [22] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/liblammps.so.0(_ZN6deepmd3hpp7DeepPot4initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKiS9_+0xe3)[0x755d0c9d17d3]
[mizubai-MS-7E28:44114] [23] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/liblammps.so.0(_ZN9LAMMPS_NS10PairDeepMD8settingsEiPPc+0x2302)[0x755d0c9cfa92]
[mizubai-MS-7E28:44114] [24] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x722)[0x755d0c66d3e2]
[mizubai-MS-7E28:44114] [25] /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x16e)[0x755d0c66db6e]
[mizubai-MS-7E28:44114] [26] lmp(+0x1311)[0x60f19d4c9311]
[mizubai-MS-7E28:44114] [27] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x755d0bc29d90]
[mizubai-MS-7E28:44114] [28] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x755d0bc29e40]
[mizubai-MS-7E28:44114] [29] lmp(+0x1395)[0x60f19d4c9395]
[mizubai-MS-7E28:44114] *** End of error message ***
Aborted (core dumped)
Steps to Reproduce
Platform
- CPU: AMD Ryzen 7 7700 8-Core Processor
- GPU: AMD Radeon RX 7800 XT
- OS: Ubuntu 22.04.5 LTS x86_64
- ROCm 6.2.4
How I built deepmd-kit and lammps from source
- Created a conda environment with python 3.10 and installed tensorflow-rocm 2.16.1 (the highest version that rocm 6.2.x supports).
- I cloned the deepmd-kit repo and check out to
tag/v3.1.1, followed the instructions to build and install it to my conda environment (/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/bin/dp) just like the offline package.
$ DP_VARIANT=rocm ROCM_ROOT=/opt/rocm-6.2.4 DP_ENABLE_TENSORFLOW=1 pip install .
- Then built deepmd-kit's c++ interface and generated
USER-DEEPMDdirectory withmake lammps. - Copied the
USER-DEEPMDdirectory to/home/mizu-bai/lammps-stable_2Aug2023_update1/src/USER-DEEPMD. The lammpsCMakeLists.txtwas patched
target_link_libraries(lmp PRIVATE lammps ncurses)
...
find_package(HIP REQUIRED)
include(/home/mizu-bai/deepmd-kit-3.1.1/source/lmp/builtin.cmake)
and configured.
$ export deepmd_root=$CONDA_PREFIX
$ cmake -D LAMMPS_INSTALL_RPATH=ON -D BUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_PREFIX_PATH=${deepmd_root} ../cmake -DCMAKE_PREFIX_PATH=/opt/rocm-6.2.4
Lammps info
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~$ which lmp
/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/bin/lmp
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~$ lmp -h
2025-10-26 16:54:52.928060: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
Large-scale Atomic/Molecular Massively Parallel Simulator - 2 Aug 2023 - Update 1
Usage example: lmp -var t 300 -echo screen -in in.alloy
List of command line options supported by this LAMMPS executable:
-echo none/screen/log/both : echoing of input script (-e)
-help : print this help message (-h)
-in none/filename : read input from file or stdin (default) (-i)
-kokkos on/off ... : turn KOKKOS mode on or off (-k)
-log none/filename : where to send log output (-l)
-mdi '<mdi flags>' : pass flags to the MolSSI Driver Interface
-mpicolor color : which exe in a multi-exe mpirun cmd (-m)
-cite : select citation reminder style (-c)
-nocite : disable citation reminder (-nc)
-nonbuf : disable screen/logfile buffering (-nb)
-package style ... : invoke package command (-pk)
-partition size1 size2 ... : assign partition sizes (-p)
-plog basename : basename for partition logs (-pl)
-pscreen basename : basename for partition screens (-ps)
-restart2data rfile dfile ... : convert restart to data file (-r2data)
-restart2dump rfile dgroup dstyle dfile ...
: convert restart to dump file (-r2dump)
-reorder topology-specs : processor reordering (-r)
-screen none/filename : where to send screen output (-sc)
-skiprun : skip loops in run and minimize (-sr)
-suffix gpu/intel/opt/omp : style suffix to apply (-sf)
-var varname value : set index style variable (-v)
OS: Linux "Ubuntu 22.04.5 LTS" 6.8.0-85-generic x86_64
Compiler: GNU C++ 11.4.0 with OpenMP 4.5
C++ standard: C++11
MPI v3.1: Open MPI v4.1.2, package: Debian OpenMPI, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
Accelerator configuration:
Active compile time flags:
-DLAMMPS_GZIP
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit
Available compression formats:
Extension: .gz Command: gzip
Extension: .bz2 Command: bzip2
Extension: .zst Command: zstd
Extension: .xz Command: xz
Extension: .lzma Command: xz
Installed packages:
KSPACE MOLECULE RIGID
List of individual style options included in this LAMMPS executable
* Atom styles:
angle atomic body bond charge
ellipsoid full hybrid line molecular
sphere template tri
* Integrate styles:
respa verlet
* Minimize styles:
cg fire/old fire hftn quickmin
sd
* Pair styles:
born born/coul/long born/coul/msm buck buck/coul/cut
buck/coul/long buck/coul/msm buck/long/coul/long coul/cut
coul/debye coul/dsf coul/long coul/msm coul/streitz
coul/wolf deepmd deepspin reax mesont/tpm
hbond/dreiding/lj hbond/dreiding/morse hybrid
hybrid/overlay hybrid/scaled lj/charmm/coul/charmm
lj/charmm/coul/charmm/implicit lj/charmm/coul/long
lj/charmm/coul/msm lj/charmmfsw/coul/charmmfsh
lj/charmmfsw/coul/long lj/cut lj/cut/coul/cut lj/cut/coul/long
lj/cut/coul/msm lj/cut/tip4p/cut lj/cut/tip4p/long
lj/expand lj/long/coul/long lj/long/tip4p/long
morse soft table tip4p/cut tip4p/long
yukawa zbl zero
* Bond styles:
fene fene/expand gromos harmonic hybrid
morse quartic table zero
* Angle styles:
charmm cosine cosine/squared harmonic hybrid
table zero
* Dihedral styles:
charmm charmmfsw harmonic hybrid multi/harmonic
opls table zero
* Improper styles:
cvff harmonic hybrid umbrella zero
* KSpace styles:
ewald ewald/dipole ewald/dipole/spin ewald/disp
ewald/disp/dipole msm msm/cg pppm
pppm/cg pppm/dipole pppm/dipole/spin pppm/disp
pppm/disp/tip4p pppm/dplr pppm/stagger pppm/tip4p
* Fix styles
adapt addforce ave/atom ave/chunk ave/correlate
ave/grid ave/histo ave/histo/weight ave/time
aveforce balance box/relax cmap deform
deposit ave/spatial ave/spatial/sphere lb/pc
lb/rigid/pc/sphere client/md dplr dt/reset
efield ehex enforce2d evaporate external
gravity halt heat indent langevin
lineforce momentum move nph nph/sphere
npt npt/sphere nve nve/limit nve/noforce
nve/sphere nvt nvt/sllod nvt/sphere pair
planeforce press/berendsen print property/atom rattle
recenter restrain rigid rigid/nph rigid/nph/small
rigid/npt rigid/npt/small rigid/nve rigid/nve/small rigid/nvt
rigid/nvt/small rigid/small setforce shake spring
spring/chunk spring/self store/force store/state temp/berendsen
temp/rescale thermal/conductivity tune/kspace vector
viscous wall/harmonic wall/lj1043 wall/lj126 wall/lj93
wall/morse wall/reflect wall/region wall/table
* Compute styles:
aggregate/atom angle angle/local angmom/chunk bond
bond/local centro/atom centroid/stress/atom chunk/atom
chunk/spread/atom cluster/atom cna/atom com
com/chunk coord/atom count/type deeptensor/atom mesont
dihedral dihedral/local dipole dipole/chunk displace/atom
erotate/rigid erotate/sphere erotate/sphere/atom fragment/atom
global/atom group/group gyration gyration/chunk heat/flux
improper improper/local inertia/chunk ke ke/atom
ke/rigid msd msd/chunk omega/chunk orientorder/atom
pair pair/local pe pe/atom pressure
property/atom property/chunk property/grid property/local rdf
reduce reduce/chunk reduce/region rigid/local slice
stress/atom temp temp/chunk temp/com temp/deform
temp/partial temp/profile temp/ramp temp/region temp/sphere
torque/chunk vacf vcm/chunk
* Region styles:
block cone cylinder ellipsoid intersect
plane prism sphere union
* Dump styles:
atom cfg custom grid grid/vtk
image local movie xyz
* Command styles
angle_write balance change_box create_atoms create_bonds
create_box delete_atoms delete_bonds box kim_init
kim_interactions kim_param kim_property kim_query
reset_ids reset_atom_ids reset_mol_ids message server
dihedral_write displace_atoms info minimize read_data
read_dump read_restart replicate rerun run
set velocity write_coeff write_data write_dump
write_restart
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~$ ldd $(which lmp)
linux-vdso.so.1 (0x00007ffe1137c000)
liblammps.so.0 => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/liblammps.so.0 (0x00007dd174c00000)
libmpi.so.40 => /lib/x86_64-linux-gnu/libmpi.so.40 (0x00007dd175564000)
libstdc++.so.6 => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libstdc++.so.6 (0x00007dd174a4c000)
libgcc_s.so.1 => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libgcc_s.so.1 (0x00007dd17554b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007dd174800000)
libfftw3.so.3 => /lib/x86_64-linux-gnu/libfftw3.so.3 (0x00007dd174400000)
libfftw3_omp.so.3 => /lib/x86_64-linux-gnu/libfftw3_omp.so.3 (0x00007dd174a43000)
libdeepmd_c.so => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_c.so (0x00007dd1747d2000)
libgomp.so.1 => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libgomp.so.1 (0x00007dd17479a000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007dd1746b3000)
libopen-rte.so.40 => /lib/x86_64-linux-gnu/libopen-rte.so.40 (0x00007dd174343000)
libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40 (0x00007dd174290000)
libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007dd174657000)
/lib64/ld-linux-x86-64.so.2 (0x00007dd1756b8000)
libdeepmd_cc.so => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/libdeepmd_cc.so (0x00007dd1741d7000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007dd174a3e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007dd174a39000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007dd17463b000)
libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x00007dd1741a2000)
libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007dd174a34000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007dd174178000)
libtensorflow_framework.so.2 => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_framework.so.2 (0x00007dd171600000)
libtensorflow_cc.so.2 => /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2 (0x00007dd110200000)
libamdhip64.so.6 => /opt/rocm-6.2.4/lib/libamdhip64.so.6 (0x00007dd10e800000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007dd174a2f000)
libhsa-runtime64.so.1 => /opt/rocm-6.2.4/lib/libhsa-runtime64.so.1 (0x00007dd10e400000)
librccl.so.1 => /opt/rocm-6.2.4/lib/librccl.so.1 (0x00007dd0d4600000)
librocprofiler-register.so.0 => /opt/rocm-6.2.4/lib/librocprofiler-register.so.0 (0x00007dd1740f6000)
libamd_comgr.so.2 => /opt/rocm-6.2.4/lib/libamd_comgr.so.2 (0x00007dd0cb800000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007dd17462e000)
libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007dd1740d8000)
libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007dd1740be000)
libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007dd17461e000)
librocm_smi64.so.7 => /opt/rocm-6.2.4/lib/librocm_smi64.so.7 (0x00007dd1100cc000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007dd17408c000)
Further Information, Files, and Links
dp train works fine, I tried the hands on CH4 case.
NOTE: If ROCM_PATH not set, such error may occur
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/CH4/01.train$ dp train input.json
2025-10-26 16:57:33.227346: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
2025-10-26 16:57:33.254646: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-26 16:57:33.472997: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: DNN
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
Switch to serial execution due to lack of horovod module.
[2025-10-26 16:57:35,750] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2025-10-26 16:57:36.043366: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.043422: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.067831: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.067877: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.067908: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.067934: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.067943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2322] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:15:00.0) with core count: 1. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
[2025-10-26 16:57:36,068] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
2025-10-26 16:57:36.173595: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173651: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173710: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173735: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173758: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173783: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2322] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:15:00.0) with core count: 1. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2025-10-26 16:57:36.173830: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173857: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173885: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:57:36.173896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15164 MB memory: -> device: 0, name: AMD Radeon RX 7800 XT, pci bus id: 0000:03:00.0
2025-10-26 16:57:36.331001: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2025-10-26 16:57:36.382561: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:243] bitcode module is required by this HLO module but was not found at ./opencl.bc
error: Failure when generating HSACO
2025-10-26 16:57:36.382714: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:207] INTERNAL: Generating device code failed.
2025-10-26 16:57:36.382955: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
2025-10-26 16:57:36.382969: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node cond/norm_1/ArithmeticOptimizer/ReplaceMulWithSquare_mul}}]]
2025-10-26 16:57:36.382975: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node cond/norm_1/ArithmeticOptimizer/ReplaceMulWithSquare_mul}}]]
[[Max/_47]]
2025-10-26 16:57:36.382981: I tensorflow/core/framework/local_rendezvous.cc:423] Local rendezvous recv item cancelled. Key hash: 6382452783897608305
2025-10-26 16:57:36.382989: I tensorflow/core/framework/local_rendezvous.cc:423] Local rendezvous recv item cancelled. Key hash: 17067356131073388760
2025-10-26 16:57:36.382998: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
2025-10-26 16:57:36.383003: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
Traceback (most recent call last):
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1402, in _do_call
return fn(*args)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1385, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1478, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node cond/norm_1/ArithmeticOptimizer/ReplaceMulWithSquare_mul}}]]
[[Max/_47]]
(1) UNKNOWN: JIT compilation failed.
[[{{node cond/norm_1/ArithmeticOptimizer/ReplaceMulWithSquare_mul}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/bin/dp", line 7, in <module>
sys.exit(main())
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/main.py", line 1020, in main
deepmd_main(args)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/entrypoints/main.py", line 72, in main
train_dp(**dict_args)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/entrypoints/train.py", line 170, in train
jdata = update_sel(jdata)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/entrypoints/train.py", line 303, in update_sel
jdata_cpy["model"], min_nbor_dist = Model.update_sel(
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/model/model.py", line 547, in update_sel
return cls.update_sel(train_data, type_map, local_jdata)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/model/model.py", line 975, in update_sel
local_jdata_cpy["descriptor"], min_nbor_dist = Descriptor.update_sel(
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/descriptor/descriptor.py", line 492, in update_sel
return cls.update_sel(train_data, type_map, local_jdata)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/descriptor/se.py", line 180, in update_sel
min_nbor_dist, local_jdata_cpy["sel"] = UpdateSel().update_one_sel(
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/update_sel.py", line 34, in update_one_sel
min_nbor_dist, tmp_sel = self.get_nbor_stat(
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/update_sel.py", line 123, in get_nbor_stat
min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/neighbor_stat.py", line 66, in get_stat
for mn, dt, jj in self.iterator(data):
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/utils/neighbor_stat.py", line 240, in iterator
minrr2, max_nnei = self.auto_batch_size.execute_all(
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 208, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 112, in execute
raise e
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 109, in execute
n_batch, result = callable(max(batch_nframes, 1), start_index)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 179, in execute_with_batch_size
return (end_index - start_index), callable(
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/utils/neighbor_stat.py", line 277, in _execute
minrr2, max_nnei = run_sess(self.sub_sess, self.op, feed_dict=feed_dict)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd/tf/utils/sess.py", line 31, in run_sess
return sess.run(*args, **kwargs)
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 972, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1215, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1395, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1421, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node cond/norm_1/ArithmeticOptimizer/ReplaceMulWithSquare_mul}}]]
[[Max/_47]]
(1) UNKNOWN: JIT compilation failed.
[[{{node cond/norm_1/ArithmeticOptimizer/ReplaceMulWithSquare_mul}}]]
0 successful operations.
0 derived errors ignored.
Solution:
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/CH4/01.train$ export ROCM_PATH=/opt/rocm-6.2.4
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/CH4/01.train$ dp train input.json
2025-10-26 16:59:41.364754: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
2025-10-26 16:59:41.391973: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-26 16:59:41.611485: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: DNN
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
Switch to serial execution due to lack of horovod module.
[2025-10-26 16:59:43,893] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2025-10-26 16:59:44.188408: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.188463: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.213118: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.213169: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.213197: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.213221: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.213230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2322] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:15:00.0) with core count: 1. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
[2025-10-26 16:59:44,213] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
2025-10-26 16:59:44.329889: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.329945: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330001: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330025: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330049: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330072: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2322] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:15:00.0) with core count: 1. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2025-10-26 16:59:44.330125: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330153: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330181: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:44.330193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15164 MB memory: -> device: 0, name: AMD Radeon RX 7800 XT, pci bus id: 0000:03:00.0
2025-10-26 16:59:44.488002: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
[2025-10-26 16:59:45,610] DEEPMD INFO Neighbor statistics: training data with minimal neighbor distance: 1.042950
[2025-10-26 16:59:45,610] DEEPMD INFO Neighbor statistics: training data with maximum neighbor size: [4 1] (cutoff radius: 6.000000)
[2025-10-26 16:59:45,628] DEEPMD INFO _____ _____ __ __ _____ _ _ _
[2025-10-26 16:59:45,628] DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
[2025-10-26 16:59:45,628] DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
[2025-10-26 16:59:45,628] DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
[2025-10-26 16:59:45,628] DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
[2025-10-26 16:59:45,628] DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
[2025-10-26 16:59:45,628] DEEPMD INFO Please read and cite:
[2025-10-26 16:59:45,628] DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2025-10-26 16:59:45,628] DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2025-10-26 16:59:45,628] DEEPMD INFO Zeng et al, J. Chem. Theory Comput., 21, 4375-4385 (2025)
[2025-10-26 16:59:45,628] DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
[2025-10-26 16:59:45,628] DEEPMD INFO -----------------------------------------------------------------------------------------------------------------------------
[2025-10-26 16:59:45,628] DEEPMD INFO installed to: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd
[2025-10-26 16:59:45,628] DEEPMD INFO source: v3.1.1
[2025-10-26 16:59:45,628] DEEPMD INFO source branch: HEAD
[2025-10-26 16:59:45,628] DEEPMD INFO source commit: bfa62458
[2025-10-26 16:59:45,628] DEEPMD INFO source commit at: 2025-10-01 01:40:47 +0800
[2025-10-26 16:59:45,628] DEEPMD INFO use float prec: double
[2025-10-26 16:59:45,628] DEEPMD INFO build variant: rocm
[2025-10-26 16:59:45,628] DEEPMD INFO Backend: TensorFlow
[2025-10-26 16:59:45,628] DEEPMD INFO TF ver: v2.16.1-4412-g292e0c2d523
[2025-10-26 16:59:45,628] DEEPMD INFO build with TF ver: 2.16.1
[2025-10-26 16:59:45,628] DEEPMD INFO build with TF inc: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/include/
[2025-10-26 16:59:45,628] DEEPMD INFO /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/include/
[2025-10-26 16:59:45,628] DEEPMD INFO build with TF lib:
[2025-10-26 16:59:45,628] DEEPMD INFO running on: mizubai-MS-7E28
[2025-10-26 16:59:45,628] DEEPMD INFO computing device: gpu:0
[2025-10-26 16:59:45,628] DEEPMD INFO HIP_VISIBLE_DEVICES: unset
[2025-10-26 16:59:45,628] DEEPMD INFO Count of visible GPUs: 1
[2025-10-26 16:59:45,628] DEEPMD INFO num_intra_threads: 0
[2025-10-26 16:59:45,628] DEEPMD INFO num_inter_threads: 0
[2025-10-26 16:59:45,628] DEEPMD INFO -----------------------------------------------------------------------------------------------------------------------------
2025-10-26 16:59:45.629654: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629705: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629762: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629787: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629811: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629835: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2322] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:15:00.0) with core count: 1. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2025-10-26 16:59:45.629876: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629903: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:45.629913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15164 MB memory: -> device: 0, name: AMD Radeon RX 7800 XT, pci bus id: 0000:03:00.0
[2025-10-26 16:59:45,630] DEEPMD INFO ---Summary of DataSystem: training -----------------------------------------------
[2025-10-26 16:59:45,630] DEEPMD INFO found 1 system(s):
[2025-10-26 16:59:45,630] DEEPMD INFO system natoms bch_sz n_bch prob pbc
[2025-10-26 16:59:45,630] DEEPMD INFO ../00.data/training_data 5 7 22 1.000e+00 T
[2025-10-26 16:59:45,630] DEEPMD INFO --------------------------------------------------------------------------------------
[2025-10-26 16:59:45,631] DEEPMD INFO ---Summary of DataSystem: validation -----------------------------------------------
[2025-10-26 16:59:45,631] DEEPMD INFO found 1 system(s):
[2025-10-26 16:59:45,631] DEEPMD INFO system natoms bch_sz n_bch prob pbc
[2025-10-26 16:59:45,631] DEEPMD INFO ../00.data/validation_data 5 7 5 1.000e+00 T
[2025-10-26 16:59:45,631] DEEPMD INFO --------------------------------------------------------------------------------------
[2025-10-26 16:59:45,631] DEEPMD INFO training without frame parameter
[2025-10-26 16:59:45,631] DEEPMD INFO data stating... (this step may take long time)
[2025-10-26 16:59:45,680] DEEPMD INFO built lr
[2025-10-26 16:59:46,102] DEEPMD INFO built network
[2025-10-26 16:59:46,584] DEEPMD INFO built training
[2025-10-26 16:59:46,584] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2025-10-26 16:59:46.584855: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:46.584939: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:46.584968: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:46.585010: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:46.585036: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:46.585047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15164 MB memory: -> device: 0, name: AMD Radeon RX 7800 XT, pci bus id: 0000:03:00.0
[2025-10-26 16:59:46,607] DEEPMD INFO initialize model from scratch
[2025-10-26 16:59:47,103] DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[2025-10-26 16:59:47,596] DEEPMD INFO batch 0: trn: rmse = 1.52e+01, rmse_e = 7.05e-01, rmse_f = 4.82e-01, lr = 1.00e-03
[2025-10-26 16:59:47,596] DEEPMD INFO batch 0: val: rmse = 1.19e+01, rmse_e = 7.04e-01, rmse_f = 3.77e-01
2025-10-26 16:59:47.940316: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:47.940424: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:47.940458: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:47.940506: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:47.940556: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:926] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-10-26 16:59:47.940592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15164 MB memory: -> device: 0, name: AMD Radeon RX 7800 XT, pci bus id: 0000:03:00.0
[2025-10-26 17:00:00,062] DEEPMD INFO batch 1000: trn: rmse = 3.15e+00, rmse_e = 4.25e-01, rmse_f = 9.97e-02, lr = 1.00e-03
[2025-10-26 17:00:00,062] DEEPMD INFO batch 1000: val: rmse = 4.91e+00, rmse_e = 4.25e-01, rmse_f = 1.55e-01
[2025-10-26 17:00:00,062] DEEPMD INFO batch 1000: total wall time = 12.96 s
Be quiet
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/CH4/01.train$ export ROCM_ROOT=/opt/rocm-6.2.4
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/CH4/01.train$ export TF_CPP_MIN_LOG_LEVEL=2
(deepmd-kit-3.1.1-py310) mizu-bai@mizubai-MS-7E28:~/CH4/01.train$ dp train input.json
2025-10-26 17:00:50.514247: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
2025-10-26 17:00:50.759782: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: DNN
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
Switch to serial execution due to lack of horovod module.
[2025-10-26 17:00:53,061] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-10-26 17:00:53,384] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[2025-10-26 17:00:54,753] DEEPMD INFO Neighbor statistics: training data with minimal neighbor distance: 1.042950
[2025-10-26 17:00:54,753] DEEPMD INFO Neighbor statistics: training data with maximum neighbor size: [4 1] (cutoff radius: 6.000000)
[2025-10-26 17:00:54,770] DEEPMD INFO _____ _____ __ __ _____ _ _ _
[2025-10-26 17:00:54,770] DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
[2025-10-26 17:00:54,770] DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
[2025-10-26 17:00:54,770] DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
[2025-10-26 17:00:54,770] DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
[2025-10-26 17:00:54,770] DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
[2025-10-26 17:00:54,770] DEEPMD INFO Please read and cite:
[2025-10-26 17:00:54,770] DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2025-10-26 17:00:54,770] DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2025-10-26 17:00:54,770] DEEPMD INFO Zeng et al, J. Chem. Theory Comput., 21, 4375-4385 (2025)
[2025-10-26 17:00:54,770] DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
[2025-10-26 17:00:54,770] DEEPMD INFO -----------------------------------------------------------------------------------------------------------------------------
[2025-10-26 17:00:54,770] DEEPMD INFO installed to: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/deepmd
[2025-10-26 17:00:54,770] DEEPMD INFO source: v3.1.1
[2025-10-26 17:00:54,770] DEEPMD INFO source branch: HEAD
[2025-10-26 17:00:54,770] DEEPMD INFO source commit: bfa62458
[2025-10-26 17:00:54,771] DEEPMD INFO source commit at: 2025-10-01 01:40:47 +0800
[2025-10-26 17:00:54,771] DEEPMD INFO use float prec: double
[2025-10-26 17:00:54,771] DEEPMD INFO build variant: rocm
[2025-10-26 17:00:54,771] DEEPMD INFO Backend: TensorFlow
[2025-10-26 17:00:54,771] DEEPMD INFO TF ver: v2.16.1-4412-g292e0c2d523
[2025-10-26 17:00:54,771] DEEPMD INFO build with TF ver: 2.16.1
[2025-10-26 17:00:54,771] DEEPMD INFO build with TF inc: /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/include/
[2025-10-26 17:00:54,771] DEEPMD INFO /home/mizu-bai/miniforge3/envs/deepmd-kit-3.1.1-py310/lib/python3.10/site-packages/tensorflow/include/
[2025-10-26 17:00:54,771] DEEPMD INFO build with TF lib:
[2025-10-26 17:00:54,771] DEEPMD INFO running on: mizubai-MS-7E28
[2025-10-26 17:00:54,771] DEEPMD INFO computing device: gpu:0
[2025-10-26 17:00:54,771] DEEPMD INFO HIP_VISIBLE_DEVICES: unset
[2025-10-26 17:00:54,771] DEEPMD INFO Count of visible GPUs: 1
[2025-10-26 17:00:54,771] DEEPMD INFO num_intra_threads: 0
[2025-10-26 17:00:54,771] DEEPMD INFO num_inter_threads: 0
[2025-10-26 17:00:54,771] DEEPMD INFO -----------------------------------------------------------------------------------------------------------------------------
[2025-10-26 17:00:54,773] DEEPMD INFO ---Summary of DataSystem: training -----------------------------------------------
[2025-10-26 17:00:54,773] DEEPMD INFO found 1 system(s):
[2025-10-26 17:00:54,773] DEEPMD INFO system natoms bch_sz n_bch prob pbc
[2025-10-26 17:00:54,773] DEEPMD INFO ../00.data/training_data 5 7 22 1.000e+00 T
[2025-10-26 17:00:54,773] DEEPMD INFO --------------------------------------------------------------------------------------
[2025-10-26 17:00:54,774] DEEPMD INFO ---Summary of DataSystem: validation -----------------------------------------------
[2025-10-26 17:00:54,774] DEEPMD INFO found 1 system(s):
[2025-10-26 17:00:54,774] DEEPMD INFO system natoms bch_sz n_bch prob pbc
[2025-10-26 17:00:54,774] DEEPMD INFO ../00.data/validation_data 5 7 5 1.000e+00 T
[2025-10-26 17:00:54,774] DEEPMD INFO --------------------------------------------------------------------------------------
[2025-10-26 17:00:54,774] DEEPMD INFO training without frame parameter
[2025-10-26 17:00:54,774] DEEPMD INFO data stating... (this step may take long time)
[2025-10-26 17:00:54,823] DEEPMD INFO built lr
[2025-10-26 17:00:55,236] DEEPMD INFO built network
[2025-10-26 17:00:55,712] DEEPMD INFO built training
[2025-10-26 17:00:55,712] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2025-10-26 17:00:55,735] DEEPMD INFO initialize model from scratch
[2025-10-26 17:00:56,214] DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[2025-10-26 17:00:56,726] DEEPMD INFO batch 0: trn: rmse = 1.52e+01, rmse_e = 7.05e-01, rmse_f = 4.82e-01, lr = 1.00e-03
[2025-10-26 17:00:56,726] DEEPMD INFO batch 0: val: rmse = 1.19e+01, rmse_e = 7.04e-01, rmse_f = 3.77e-01
[2025-10-26 17:01:09,239] DEEPMD INFO batch 1000: trn: rmse = 3.15e+00, rmse_e = 4.25e-01, rmse_f = 9.97e-02, lr = 1.00e-03
[2025-10-26 17:01:09,239] DEEPMD INFO batch 1000: val: rmse = 4.91e+00, rmse_e = 4.25e-01, rmse_f = 1.55e-01
[2025-10-26 17:01:09,239] DEEPMD INFO batch 1000: total wall time = 13.03 s
[2025-10-26 17:01:21,349] DEEPMD INFO batch 2000: trn: rmse = 3.42e+00, rmse_e = 6.53e-02, rmse_f = 1.08e-01, lr = 1.00e-03
[2025-10-26 17:01:21,349] DEEPMD INFO batch 2000: val: rmse = 4.34e+00, rmse_e = 6.53e-02, rmse_f = 1.37e-01
[2025-10-26 17:01:21,349] DEEPMD INFO batch 2000: total wall time = 12.11 s