-
Notifications
You must be signed in to change notification settings - Fork 929
Description
I met a problem in testing PETSc' error handler with OpenMPI-5.0. I used one MPI rank. When PETSc detected an error (integer multiplication overflow), it would print the stack trace and then call MPI_Finalize().
In program execution, petsc dupped an inner communicator from the outer MPI_COMM_WORLD. We linked the two communicators via attributes on them and set up the comm_delete_attr_fn's on them.
I found in MPI_Finalize(), ompi tried to delete attributes on MPI_COMM_WORLD, which invoked Petsc_InnerComm_Attr_Delete_Fn(). There, we had the attribute value (pointer to the inner communicator), and called some MPI functions, such as MPI_Comm_get_attr(), MPI_Comm_delete_attr() to unlink the outer comm from the inner comm.
But at that time, ompi_instance_count is already 0, MPI_Comm_get_attr just failed on checking OMPI_ERR_INIT_FINALIZE and dumped a misleading message such as
[frog:00000] *** An error occurred in MPI_Comm_get_attr
[frog:00000] *** reported by process [3339452416,0]
[frog:00000] *** on a NULL communicator
[frog:00000] *** Unknown error
[frog:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[frog:00000] *** and MPI will try to terminate your MPI job as well)
This is the backtrace in gdb of my code, which failed at line 400.
(gdb) l
395 union
396 {
397 MPI_Comm comm;
398 void *ptr;
399 } ocomm;
400 PetscCallMPI(MPI_Comm_get_attr(icomm.comm, Petsc_OuterComm_keyval, &ocomm, &flg));
401 if (!flg) SETERRMPI(PETSC_COMM_SELF, PETSC_ERR_ARG_CORRUPT, "Inner comm does not have OuterComm attribute");
402 if (ocomm.comm != comm) SETERRMPI(PETSC_COMM_SELF, PETSC_ERR_ARG_CORRUPT, "Inner comm's OuterComm attribute does not point to outer PETSc comm");
403 }
404 PetscCallMPI(MPI_Comm_delete_attr(icomm.comm, Petsc_OuterComm_keyval));
gdb) bt
#0 Petsc_InnerComm_Attr_Delete_Fn (comm=0x7fffb5df81a0 <ompi_mpi_comm_world>, keyval=14, attr_val=0xa37ee0, extra_state=0x0) at /home/jczhang/petsc/src/sys/objects/pinit.c:400
#1 0x00007fffb5a5b239 in ompi_attr_delete_impl (type=COMM_ATTR, object=0x7fffb5df81a0 <ompi_mpi_comm_world>, attr_hash=0xb12470, key=14, predefined=true) at attribute/attribute.c:1139
#2 0x00007fffb5a5b81d in ompi_attr_delete_all (type=COMM_ATTR, object=0x7fffb5df81a0 <ompi_mpi_comm_world>, attr_hash=0xb12470) at attribute/attribute.c:1243
#3 0x00007fffb5a5d616 in ompi_comm_finalize () at communicator/comm_init.c:319
#4 0x00007fffe7136b3d in opal_finalize_cleanup_domain (domain=0x7fffe71fade0 <opal_init_domain>) at runtime/opal_finalize_core.c:128
#5 0x00007fffe7129c08 in opal_finalize () at runtime/opal_finalize.c:56
#6 0x00007fffb5a92e0b in ompi_rte_finalize () at runtime/ompi_rte.c:1041
#7 0x00007fffb5a9aec3 in ompi_mpi_instance_finalize_common () at instance/instance.c:911
#8 0x00007fffb5a9b0c9 in ompi_mpi_instance_finalize (instance=0x7fffb5e092c0 <ompi_mpi_instance_default>) at instance/instance.c:965
#9 0x00007fffb5a8de27 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#10 0x00007fffb5acf918 in PMPI_Finalize () at finalize.c:52
#11 0x00007ffff196f2c5 in PetscError (comm=0x7fffb5df83a0 <ompi_mpi_comm_self>, line=109, func=0x40b167 <__func__.4> "main", file=0x40b051 "ex19.c", n=PETSC_ERR_SUP, p=PETSC_ERROR_REPEAT, mess=0x40b058 " ")
at /home/jczhang/petsc/src/sys/error/err.c:417
#12 0x0000000000402946 in main (argc=5, argv=0x7fffffffd7a8) at ex19.c:109
I checked mpi manual and it does not say users should not call MPI in callbacks.