Skip to content

OMPI-5.0 error in MPI_Finalize() with user's comm_delete_attr_fn calling MPI #12035

@jczhang07

Description

@jczhang07

I met a problem in testing PETSc' error handler with OpenMPI-5.0. I used one MPI rank. When PETSc detected an error (integer multiplication overflow), it would print the stack trace and then call MPI_Finalize().

In program execution, petsc dupped an inner communicator from the outer MPI_COMM_WORLD. We linked the two communicators via attributes on them and set up the comm_delete_attr_fn's on them.

I found in MPI_Finalize(), ompi tried to delete attributes on MPI_COMM_WORLD, which invoked Petsc_InnerComm_Attr_Delete_Fn(). There, we had the attribute value (pointer to the inner communicator), and called some MPI functions, such as MPI_Comm_get_attr(), MPI_Comm_delete_attr() to unlink the outer comm from the inner comm.
But at that time, ompi_instance_count is already 0, MPI_Comm_get_attr just failed on checking OMPI_ERR_INIT_FINALIZE and dumped a misleading message such as

[frog:00000] *** An error occurred in MPI_Comm_get_attr
[frog:00000] *** reported by process [3339452416,0]
[frog:00000] *** on a NULL communicator
[frog:00000] *** Unknown error
[frog:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[frog:00000] ***    and MPI will try to terminate your MPI job as well)

This is the backtrace in gdb of my code, which failed at line 400.


(gdb) l
395         union
396         {
397           MPI_Comm comm;
398           void    *ptr;
399         } ocomm;
400         PetscCallMPI(MPI_Comm_get_attr(icomm.comm, Petsc_OuterComm_keyval, &ocomm, &flg));
401         if (!flg) SETERRMPI(PETSC_COMM_SELF, PETSC_ERR_ARG_CORRUPT, "Inner comm does not have OuterComm attribute");
402         if (ocomm.comm != comm) SETERRMPI(PETSC_COMM_SELF, PETSC_ERR_ARG_CORRUPT, "Inner comm's OuterComm attribute does not point to outer PETSc comm");
403       }
404       PetscCallMPI(MPI_Comm_delete_attr(icomm.comm, Petsc_OuterComm_keyval));
gdb) bt
#0  Petsc_InnerComm_Attr_Delete_Fn (comm=0x7fffb5df81a0 <ompi_mpi_comm_world>, keyval=14, attr_val=0xa37ee0, extra_state=0x0) at /home/jczhang/petsc/src/sys/objects/pinit.c:400
#1  0x00007fffb5a5b239 in ompi_attr_delete_impl (type=COMM_ATTR, object=0x7fffb5df81a0 <ompi_mpi_comm_world>, attr_hash=0xb12470, key=14, predefined=true) at attribute/attribute.c:1139
#2  0x00007fffb5a5b81d in ompi_attr_delete_all (type=COMM_ATTR, object=0x7fffb5df81a0 <ompi_mpi_comm_world>, attr_hash=0xb12470) at attribute/attribute.c:1243
#3  0x00007fffb5a5d616 in ompi_comm_finalize () at communicator/comm_init.c:319
#4  0x00007fffe7136b3d in opal_finalize_cleanup_domain (domain=0x7fffe71fade0 <opal_init_domain>) at runtime/opal_finalize_core.c:128
#5  0x00007fffe7129c08 in opal_finalize () at runtime/opal_finalize.c:56
#6  0x00007fffb5a92e0b in ompi_rte_finalize () at runtime/ompi_rte.c:1041
#7  0x00007fffb5a9aec3 in ompi_mpi_instance_finalize_common () at instance/instance.c:911
#8  0x00007fffb5a9b0c9 in ompi_mpi_instance_finalize (instance=0x7fffb5e092c0 <ompi_mpi_instance_default>) at instance/instance.c:965
#9  0x00007fffb5a8de27 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#10 0x00007fffb5acf918 in PMPI_Finalize () at finalize.c:52
#11 0x00007ffff196f2c5 in PetscError (comm=0x7fffb5df83a0 <ompi_mpi_comm_self>, line=109, func=0x40b167 <__func__.4> "main", file=0x40b051 "ex19.c", n=PETSC_ERR_SUP, p=PETSC_ERROR_REPEAT, mess=0x40b058 " ")
    at /home/jczhang/petsc/src/sys/error/err.c:417
#12 0x0000000000402946 in main (argc=5, argv=0x7fffffffd7a8) at ex19.c:109

I checked mpi manual and it does not say users should not call MPI in callbacks.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions