-
Notifications
You must be signed in to change notification settings - Fork 928
Closed
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.x head
$ git log --oneline -10
75795c04eb (HEAD -> v5.0.x, origin/v5.0.x) Merge pull request #12821 from Sergei-Lebedev/topic/coll_ucc_fix_buf_size_overflow_v5
a2868acd84 coll/ucc: fix int overflow in coll init
6f08eaf910 Merge pull request #12781 from janjust/v5.0.x
6f91498f59 Merge pull request #12809 from edgargabriel/pr/vulcan-aggr-list-leak-v5.0.x
ff740b4256 fcoll/vulcan: fix memory leak
d380ab6971 Merge pull request #12798 from wenduwan/fix_ipv6
ce3b892360 3rd-party/openpmix: include ipv6 fix
3968cab0fe Merge pull request #12800 from wenduwan/test_mpi4py
b4c98c9487 .github/workflow: set up runtime params right before mpi4py test
3bec944cf0 Merge pull request #12789 from jsquyres/pr/v5.0.x/gcc-14-complier-warning-fixes
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Source build
./configure --with-sge --without-verbs --disable-man-pages --enable-ipv6 LDFLAGS=-Wl,--as-needed --enable-prte-prefix-by-default --enable-mca-dso=all --with-libevent=external --with-hwloc=external --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-debug
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
$ git submodule status
e62fa4252f0cadda29c4103e01b0e277e8180d3e 3rd-party/openpmix (v5.0.3-17-ge62fa425)
b68a0acb32cfc0d3c19249e5514820555bcf438b 3rd-party/prrte (v3.0.6)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
- Operating system/version: Amazon Linux 2
- Computer hardware: AWS EC2 p4d.24xlarge
$ nvidia-smi
Tue Sep 24 17:56:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:10:1C.0 Off | 0 |
| N/A 45C P0 60W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:10:1D.0 Off | 0 |
| N/A 41C P0 57W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 |
| N/A 44C P0 59W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:20:1D.0 Off | 0 |
| N/A 39C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:90:1C.0 Off | 0 |
| N/A 42C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:90:1D.0 Off | 0 |
| N/A 41C P0 58W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:A0:1C.0 Off | 0 |
| N/A 46C P0 62W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 |
| N/A 40C P0 63W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
- Network type: BTL/SM
Details of the problem
We are seeing segfaults with this commit: https://github.com/open-mpi/ompi/pull/12781/files#diff-750d0e8be09c5f4ee5f703b8ba2c735a3e1b8b807162936e55530ec721ec5b86
mpirun --wdir . -n 2 --mca pml ob1 openmpi-v5.0.6a1-v5.0.x-debug/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -d cuda D D
The backtrace is
(gdb) bt
#0 0x00007fd46edddbe8 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1 0x00007fd46e5f3653 in opal_convertor_accelerator_memcpy (dest=0x7fd41755ce40, src=0x7fd43b200000, size=1, convertor=0x7ffedb4caf80) at opal_convertor.c:52
#2 0x00007fd46e5f3e93 in opal_convertor_pack (pConv=0x7ffedb4caf80, iov=0x7ffedb4cae70, out_size=0x7ffedb4cae84, max_data=0x7ffedb4cae88) at opal_convertor.c:284
#3 0x00007fd42114bb61 in mca_btl_sm_sendi (btl=0x7fd421350180 <mca_btl_sm>, endpoint=0x400c3830, convertor=0x7ffedb4caf80, header=0x7ffedb4cb0b0, header_size=16, payload_size=1,
order=255 '\377', flags=3, tag=65 'A', descriptor=0x0) at btl_sm_sendi.c:98
#4 0x00007fd4208e9c2d in mca_bml_base_sendi (bml_btl=0x7fd41c068540, convertor=0x7ffedb4caf80, header=0x7ffedb4cb0b0, header_size=16, payload_size=1, order=255 '\377', flags=3,
tag=65 'A', descriptor=0x0) at ../../../../ompi/mca/bml/bml.h:301
#5 0x00007fd4208eae09 in mca_pml_ob1_send_inline (buf=0x7fd43b200000, count=1, datatype=0x62ef80 <ompi_mpi_char>, dst=1, tag=100, seqn=2, dst_proc=0x40089a80, ob1_proc=0x3fbb9b40,
endpoint=0x400c5880, comm=0x62f980 <ompi_mpi_comm_world>) at pml_ob1_isend.c:125
#6 0x00007fd4208eaf62 in mca_pml_ob1_isend (buf=0x7fd43b200000, count=1, datatype=0x62ef80 <ompi_mpi_char>, dst=1, tag=100, sendmode=MCA_PML_BASE_SEND_STANDARD,
comm=0x62f980 <ompi_mpi_comm_world>, request=0x6310e0 <send_request>) at pml_ob1_isend.c:182
#7 0x00007fd46f550673 in PMPI_Isend (buf=0x7fd43b200000, count=1, type=0x62ef80 <ompi_mpi_char>, dest=1, tag=100, comm=0x62f980 <ompi_mpi_comm_world>, request=0x6310e0 <send_request>)
at isend.c:101
#8 0x000000000040304f in main (argc=<optimized out>, argv=<optimized out>) at osu_bibw.c:216
We also get segfault with EFA network but so far the issue appears to be within CUDA memory copy.