Error running the multi-GPU implementation #87

PabloPiaggi · 2019-10-04T13:01:50Z

PabloPiaggi
Oct 4, 2019

I am testing the new implementation in branch r0.12 that supports multiple GPUs. I am compiling the deepmd-kit, the python package of tensorflow and the tensorflow library using the following software/libraries:

Anaconda with Python 3.6.8
Bazel 0.24.1
gcc 7.3.1
cudatoolkit 10.0
openmpi 3.1.4
tensorflow 1.14.
deepmd-kit branch r0.12

All the compilations are successful. I can also run jobs with 1 GPU and 1 CPU. However, using 2 GPUs and 2 CPUs fails with a segmentation fault. This is lammps' output:

2019-10-04 08:52:09.730185: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-04 08:52:09.730185: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
[traverse-k05g2:20819] *** Process received signal ***
[traverse-k05g2:20819] Signal: Segmentation fault (11)
[traverse-k05g2:20819] Signal code: Address not mapped (1)
[traverse-k05g2:20819] Failing at address: (nil)
[traverse-k05g2:20818] *** Process received signal ***
[traverse-k05g2:20818] Signal: Segmentation fault (11)
[traverse-k05g2:20818] Signal code: Address not mapped (1)
[traverse-k05g2:20818] Failing at address: (nil)
[traverse-k05g2:20818] [traverse-k05g2:20819] [ 0] [0x2000000504d8]
[ 0] [0x2000000504d8]
[traverse-k05g2:20818] [ 1] [0x7ffff3ac1390]
[traverse-k05g2:20818] [ 2] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-pal.so.40(opal_hwloc_base_filter_cpus+0x484)[0x200020acca44]
[traverse-k05g2:20818] [ 3] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-pal.so.40(opal_hwloc_base_get_topology+0x4f0)[0x200020ad1830]
[traverse-k05g2:20818] [ 4] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-rte.so.40(orte_ess_base_proc_binding+0x9b4)[0x2000209a0df4]
[traverse-k05g2:20818] [ 5] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/openmpi/mca_ess_pmi.so(+0x3e74)[0x200020ef3e74]
[traverse-k05g2:20818] [traverse-k05g2:20819] [ 1] [0x7fffffdc6470]
[traverse-k05g2:20819] [ 2] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-pal.so.40(opal_hwloc_base_filter_cpus+0x484)[0x200020acca44]
[traverse-k05g2:20819] [ 3] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-pal.so.40(opal_hwloc_base_get_topology+0x4f0)[0x200020ad1830]
[traverse-k05g2:20819] [ 4] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-rte.so.40(orte_ess_base_proc_binding+0x9b4)[0x2000209a0df4]
[traverse-k05g2:20819] [ 5] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/openmpi/mca_ess_pmi.so(+0x3e74)[0x200020ef3e74]
[traverse-k05g2:20819] [ 6] [ 6] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-rte.so.40(orte_init+0x3ac)[0x200020950cdc]
[traverse-k05g2:20819] [ 7] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libopen-rte.so.40(orte_init+0x3ac)[0x200020950cdc]
[traverse-k05g2:20818] [ 7] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libmpi.so.40(ompi_mpi_init+0x40c)[0x200018e210ec]
[traverse-k05g2:20819] [ 8] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libmpi.so.40(MPI_Init+0xa4)[0x200018e52f44]
[traverse-k05g2:20819] [ 9] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libmpi.so.40(ompi_mpi_init+0x40c)[0x200018e210ec]
[traverse-k05g2:20818] [ 8] /usr/local/openmpi/3.1.4/gcc/ppc64le/lib64/libmpi.so.40(MPI_Init+0xa4)[0x200018e52f44]
[traverse-k05g2:20818] [ 9] /home/ppiaggi/Programs/Lammps/lammps-deepmd/src/lmp_mpi(main+0x30)[0x101e9550]
[traverse-k05g2:20819] [10] /home/ppiaggi/Programs/Lammps/lammps-deepmd/src/lmp_mpi(main+0x30)[0x101e9550]
[traverse-k05g2:20818] [10] /usr/lib64/libc.so.6(+0x25200)[0x200019115200]
[traverse-k05g2:20819] [11] /usr/lib64/libc.so.6(+0x25200)[0x200019115200]
[traverse-k05g2:20818] [11] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000191153f4]
[traverse-k05g2:20819] *** End of error message ***
/usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000191153f4]
[traverse-k05g2:20818] *** End of error message ***

Any suggestions?

Thanks,

Pablo

denghuilu · 2019-10-04T13:40:06Z

denghuilu
Oct 4, 2019
Collaborator

Hello Pablo, I think it probably the gcc or openmpi version issue. I have just tested the r0.12 branch by using gcc-5.4 with openmpi4.0.1. It all works well, so please instead try to use gcc-5.4 or lower version of gcc to compile the deepmd-kit and lammps.

0 replies

PabloPiaggi · 2019-10-04T13:42:29Z

PabloPiaggi
Oct 4, 2019
Author

Thanks @denghuilu . I will try that.

0 replies

PabloPiaggi · 2019-10-04T20:22:39Z

PabloPiaggi
Oct 4, 2019
Author

@denghuilu I just tried with gcc 4.8.5 and openmpi 4.0.1, and I still have the same problem.

Lammps works properly if I remove DeePMD, dp_train seems to work fine, Lammps and DeePMD work OK if I use 1CPU and 1GPU. However, Lammps with DeePMD gives the segmentation fault I mentioned above with 2GPU and 2CPU.

Any other ideas?

0 replies

denghuilu · 2019-10-05T02:39:57Z

denghuilu
Oct 5, 2019
Collaborator

@PabloPiaggi I have not encountered this problem before, nor can I reproduce this problem on my workstation. You can use two CPUs with one GPU to see what happens. And it would be helpful if you can provide the full LAMMPS output log. By the way, which version of LAMMPS are you using? Could you use intel impi to compile and run the LAMMPS?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error running the multi-GPU implementation #87

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error running the multi-GPU implementation #87

Uh oh!

PabloPiaggi Oct 4, 2019

Replies: 4 comments

Uh oh!

denghuilu Oct 4, 2019 Collaborator

Uh oh!

PabloPiaggi Oct 4, 2019 Author

Uh oh!

PabloPiaggi Oct 4, 2019 Author

Uh oh!

denghuilu Oct 5, 2019 Collaborator

PabloPiaggi
Oct 4, 2019

denghuilu
Oct 4, 2019
Collaborator

PabloPiaggi
Oct 4, 2019
Author

PabloPiaggi
Oct 4, 2019
Author

denghuilu
Oct 5, 2019
Collaborator