Skip to content

Error in MPI_Error_string after failed call to MPI_T_pvar_get_info #7084

@devreal

Description

@devreal

I'm trying to run the varlist developed by LLNL that is supposed to list all MPI_T variables available. The code is available at https://github.com/LLNL/mpi-tools.

Running the benchmark ends in an error:

Found 20 performance variables
Found 20 performance variables with verbosity <= D/A-9

Variable                        VRB   Class   Type   Bind     R/O CNT ATM
-------------------------------------------------------------------------
mpool_hugepage_bytes_allocated  U/A-3 SIZE    ULONG  n/a      YES YES  NO
ERROR: PVARINFO: MPI error code -18: 

Thread 1 "varlist" hit Breakpoint 1, PMPI_Error_string (errorcode=-18, string=0x55555575a080 <errMsg> "MPI_ERR_OTHER: known error not in list", 
    resultlen=0x55555575a468 <errMsgLen>) at perror_string.c:44
44          OPAL_CR_NOOP_PROGRESS();
(gdb) bt
#0  PMPI_Error_string (errorcode=-18, string=0x55555575a080 <errMsg> "MPI_ERR_OTHER: known error not in list", resultlen=0x55555575a468 <errMsgLen>) at perror_string.c:44
#1  0x00005555555562a1 in list_pvars () at /home/joseph/src/mpi-tools/mpi_t/varlist/varlist.c:410
#2  0x0000555555557ca3 in main (argc=1, argv=0x7fffffffdcf8) at /home/joseph/src/mpi-tools/mpi_t/varlist/varlist.c:899
(gdb) f 1
#1  0x00005555555562a1 in list_pvars () at /home/joseph/src/mpi-tools/mpi_t/varlist/varlist.c:410
410                     CHECKERR("PVARINFO",err);
(gdb) print err
$4 = -18
(gdb) list
405             for (i=0; i<num; i++)
406             {
407                     namelen=maxnamelen;
408                     desclen=maxdesclen;
409                     err=MPI_T_pvar_get_info(i,name,&namelen,&verbos,&vc,&dt,&et,desc,&desclen,&bind,&ro,&ct,&at);
410                     CHECKERR("PVARINFO",err);
411                     if (verbos<=verbosity)
412                     {
413                             if (!longlist)
414                             {
(gdb) print name
$5 = 0x55555593a630 "mpool_hugepage_bytes_allocated"
(gdb) c
Continuing.
[beryl:18517] *** An error occurred in MPI_Error_string
[beryl:18517] *** reported by process [3787063297,0]
[beryl:18517] *** on communicator MPI_COMM_WORLD
[beryl:18517] *** MPI_ERR_ARG: invalid argument of some other kind
[beryl:18517] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[beryl:18517] ***    and potentially your MPI job)
[Thread 0x7fffed921700 (LWP 18523) exited]
[Thread 0x7ffff4f6c700 (LWP 18522) exited]
[Inferior 1 (process 18517) exited with code 015]

The definition of the CHECKERR macro is:

#define CHECKERR(errstr,err) if (err!=MPI_SUCCESS) { printf("ERROR: %s: MPI error code %i: \n",errstr,err); MPI_Error_string(err, errMsg, &errMsgLen); errMsg[errMsgLen]=    0; printf("%s\n", errMsg); /*usage(1);*/ }

It checks the error and calls MPI_Error_string on the value returned by the previous call, MPI_T_pvar_get_info in this case. The value is -18 in this case. The call then causes a fatal error inside Open MPI.

I tried running the varlist tool using MPICH 3.2.1, which runs fine (although it reports that there are no performance variables in MPICH).

I tested with both the v4.0.x and master branches, both showing the same error.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions