Some problems encountered in using GPU to accelerate lammps #3033

SEU-NiuWenLong · 2023-12-02T13:33:05Z

SEU-NiuWenLong
Dec 2, 2023

I have two GPU cards, and when I use GPU acceleration, lammps always breaks off after running for a while and reports an error.
Possible remote error message: ESC[31m==> /home/gcniu/workspace/deepmd/23-32/run/temp/81c427e9fd55ff100029be97075854c91642ee29/task.002.000055/model_devi
.log <==
ibdeepmd_1697184996481/work/source/lib/src/gpu/prod_env_mat.cu: 625, in file /home/conda/feedstock_root/build_artifacts/libdeepmd_1697184996481/work/sour
ce/op/custom_op.cc:18
[[{{node ProdEnvMatA}}]]
[[o_energy/_31]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Runtime library throws an error: an illegal memory access was encountered, in file /home/conda/feedstock_root/build_artifacts/libdeepmd_1697184996481/work/source/lib/src/gpu/prod_env_mat.cu: 625, in file /home/conda/feedstock_root/build_artifacts/libdeepmd_1697184996481/work/source/op/custom_op.cc:18
[[{{node ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored. (/home/conda/feedstock_root/build_artifacts/libdeepmd_1697184996481/work/source/lmp/pair_deepmd.cpp:634)
Last command: run ${NSTEPS} upto
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor
ESC[0m
This is my machine.json file:
"model_devi": [
{
"command": "lmp",
"machine": {
"context_type": "local",
"batch_type": "Slurm",
"local_root": "./",
"remote_root": "/home/gcniu/workspace/deepmd/23-32/run/temp"
},
"resources": {
"number_node": 1,
"cpu_per_node": 16,
"gpu_per_node": 2,
"queue_name": "GPU",
"strategy":{"if_cuda_multi_devices":true},
"custom_flags" : [
"#SBATCH -J gcniu",
"#SBATCH -n 16",
"#SBATCH -o %j.log",
"#SBATCH -e %j.log"
],
"group_size": 1000,
"_source_list": ["/home/gcniu/workspace/deepmd/23-32/run/envs.sh"]
}
}
]
Is there something wrong with my parameter file configuration? Or is it something else?

njzjz · 2023-12-04T21:34:21Z

njzjz
Dec 4, 2023
Maintainer

@Yi-FanLi Is this the error you got?

1 reply

njzjz Dec 4, 2023
Maintainer

Yifan said yes. Let's track in #3034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some problems encountered in using GPU to accelerate lammps #3033

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Some problems encountered in using GPU to accelerate lammps #3033

Uh oh!

SEU-NiuWenLong Dec 2, 2023

Replies: 1 comment · 1 reply

Uh oh!

njzjz Dec 4, 2023 Maintainer

Uh oh!

njzjz Dec 4, 2023 Maintainer

SEU-NiuWenLong
Dec 2, 2023

Replies: 1 comment 1 reply

njzjz
Dec 4, 2023
Maintainer

njzjz Dec 4, 2023
Maintainer