- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
Open
Labels
Milestone
Description
Background information
This is related to #10158 . I am opening a separate issue hoping to provide a better example of the error I am seeing.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
source
Please describe the system on which you are running
- Operating system/version: AlmaLinux release 8.5 (Arctic Sphynx)
- Computer hardware: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 12 cores, 125G
- Network type: Infiniband
Details of the problem
As recomendet by the Howard from the mailing list, I tried to rung ping-pong-mpi-tcp project. When I run the program in a similar fashion to mpi.sh in the repository. I receive a segmentation fault. This error happens in StringBuilder.append(String). The error report of the jvm is attached here: hs_err_pid31983.log. The log of the application looks like the following:
2022-04-05 16:50:33 [main] MPIMain.main()
INFO: Args received: [2, false]
2022-04-05 16:50:33 [main] MPIMain.main()
INFO: Args received: [2, false]
[INFO] 16:50:34:294 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 16:50:34:294 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 16:50:34:298 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[INFO] 16:50:34:298 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[INFO] 16:50:35:494 config.GlobalConfig.registerRole(): Registering role: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=false}
[INFO] 16:50:35:503 config.GlobalConfig.registerRole(): Registering role: Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=false}
[INFO] 16:50:35:540 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=0, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 16:50:35:540 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=1, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 16:50:35:541 role.Node.<init>(): Node created: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}
[INFO] 16:50:35:541 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=1, groupId=2}] registered on role [Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=true}]
[INFO] 16:50:35:541 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=0, groupId=2}] registered on role [Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=false}]
[INFO] 16:50:35:542 role.Node.<init>(): Node created: Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=false}
[node500:31983:0:31990] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14)
[INFO] 16:50:36:566 testframework.TestFramework._doPingTests(): Starting ping-pong tests...
==== backtrace (tid:  31990) ====
 0  /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x1512416f32a4]
 1  /usr/lib64/libucs.so.0(+0x2347c) [0x1512416f347c]
 2  /usr/lib64/libucs.so.0(+0x2364a) [0x1512416f364a]
 3  [0x1512949256d4]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00001512949256d4 (sent by kill), pid=31983, tid=31990
#
# JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
# Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# J 448 c2 java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; [email protected] (8 bytes) @ 0x00001512949256d4 [0x00001512949256a0+0x0000000000000034]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /net/ils/laudan/2-mpi-test/core.31983)
#
# An error report file with more information is saved as:
# /net/ils/laudan/2-mpi-test/hs_err_pid31983.log
Compiled method (c2)    3781  448       4       java.lang.StringBuilder::append (8 bytes)
 total in heap  [0x0000151294925510,0x0000151294925fb0] = 2720
 relocation     [0x0000151294925670,0x00001512949256a0] = 48
 main code      [0x00001512949256a0,0x0000151294925be0] = 1344
 stub code      [0x0000151294925be0,0x0000151294925bf8] = 24
 metadata       [0x0000151294925bf8,0x0000151294925c50] = 88
 scopes data    [0x0000151294925c50,0x0000151294925e88] = 568
 scopes pcs     [0x0000151294925e88,0x0000151294925f68] = 224
 dependencies   [0x0000151294925f68,0x0000151294925f70] = 8
 handler table  [0x0000151294925f70,0x0000151294925f88] = 24
 nul chk table  [0x0000151294925f88,0x0000151294925fb0] = 40
[node500:31983] *** Process received signal ***
[node500:31983] Signal: Aborted (6)
[node500:31983] Signal code:  (-6)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
[node500:31983] [ 0] /usr/lib64/libpthread.so.0(+0x12c20)[0x1512aa492c20]
[node500:31983] [ 1] /usr/lib64/libc.so.6(gsignal+0x10f)[0x1512a9eee37f]
[node500:31983] [ 2] /usr/lib64/libc.so.6(abort+0x127)[0x1512a9ed8db5]
[node500:31983] [ 3] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0x246cc9)[0x1512a8e90cc9]
[node500:31983] [ 4] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0xe0e70c)[0x1512a9a5870c]
[node500:31983] [ 5] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0xe0f12b)[0x1512a9a5912b]
[node500:31983] [ 6] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0xe0f15e)[0x1512a9a5915e]
[node500:31983] [ 7] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x198)[0x1512a9906148]
[node500:31983] [ 8] /usr/lib64/libpthread.so.0(+0x12c20)[0x1512aa492c20]
[node500:31983] [ 9] [0x1512949256d4]
[node500:31983] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node node500 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
The program was started with the following command:
OMPI="/homes2/ils/laudan/ompi-java17/bin/mpirun"
JAVA="/homes2/ils/laudan/jdk-17.0.2/bin/java"
$OMPI --mca pml ucx -np 2\
 $JAVA -cp ping-pong-mpi-tcp-1.0-SNAPSHOT-jar-with-dependencies.jar MPIMain 2 false
Any help would be much appreciated.