Error running the multi-GPU implementation #87
Replies: 4 comments
-
Hello Pablo, I think it probably the gcc or openmpi version issue. I have just tested the r0.12 branch by using gcc-5.4 with openmpi4.0.1. It all works well, so please instead try to use gcc-5.4 or lower version of gcc to compile the deepmd-kit and lammps. |
Beta Was this translation helpful? Give feedback.
-
Thanks @denghuilu . I will try that. |
Beta Was this translation helpful? Give feedback.
-
@denghuilu I just tried with gcc 4.8.5 and openmpi 4.0.1, and I still have the same problem. Lammps works properly if I remove DeePMD, dp_train seems to work fine, Lammps and DeePMD work OK if I use 1CPU and 1GPU. However, Lammps with DeePMD gives the segmentation fault I mentioned above with 2GPU and 2CPU. Any other ideas? |
Beta Was this translation helpful? Give feedback.
-
@PabloPiaggi I have not encountered this problem before, nor can I reproduce this problem on my workstation. You can use two CPUs with one GPU to see what happens. And it would be helpful if you can provide the full LAMMPS output log. By the way, which version of LAMMPS are you using? Could you use intel impi to compile and run the LAMMPS? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am testing the new implementation in branch r0.12 that supports multiple GPUs. I am compiling the deepmd-kit, the python package of tensorflow and the tensorflow library using the following software/libraries:
All the compilations are successful. I can also run jobs with 1 GPU and 1 CPU. However, using 2 GPUs and 2 CPUs fails with a segmentation fault. This is lammps' output:
Any suggestions?
Thanks,
Pablo
Beta Was this translation helpful? Give feedback.
All reactions