Skip to content

Conversation

@kawashima-fj
Copy link
Member

cherry picked from commit 0021616

Data transferred by `MPI_BSEND` may corrupt if all of the following
conditions are met.

- The message size is less than the eager limit.
- The `btl_alloc` function in the BTL interface returns `NULL`
  for some reason.
- The MPI program overwrites the send buffer after `MPI_BSEND`
  returns.

The problem is in the way of pending a send request in ob1 PML.
The `mca_pml_ob1_send_request_start_copy` function retruns
`OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns
`des = NULL`. In this case, the send request is added to the
`send_pending` list and `MPI_BSEND` returns immediately. Next time
the `mca_pml_ob1_send_request_start_copy` function tries sending,
the user buffer may have been overwritten by the MPI program.

Call hierarchy of `MPI_BSEND`:

```
  MPI_Bsend
    mca_pml_ob1_send
      if (MCA_PML_BASE_SEND_BUFFERED == sendmode)
        mca_pml_ob1_isend
          MCA_PML_OB1_SEND_REQUEST_START_W_SEQ
            mca_pml_ob1_send_request_start_seq
              mca_pml_ob1_send_request_start_btl
                if (size <= eager_limit)
                  if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED)
                    mca_pml_ob1_send_request_start_copy
                      mca_bml_base_alloc
                        btl_alloc
              if (OMPI_ERR_OUT_OF_RESOURCE == rc)
                add_request_to_send_pending
        ompi_request_free
```

To solve this problem, we should save the data to the buffer
attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`.

This problem was introduced by ob1 optimization (commits 2b57f42
and a06e491) in v1.8 series.

Signed-off-by: KAWASHIMA Takahiro <[email protected]>
(cherry picked from commit 0021616)
@kawashima-fj kawashima-fj added this to the v3.1.2 milestone Jul 17, 2018
@kawashima-fj kawashima-fj requested review from bosilca and removed request for bosilca July 17, 2018 06:02
@kawashima-fj
Copy link
Member Author

Mellanox CI failed to fetch repo.
bot:mellanox:retest

@jsquyres
Copy link
Member

@bosilca Can you review?

@jsquyres
Copy link
Member

jsquyres commented Aug 7, 2018

@bosilca You reviewed #5437, which is the v3.0.x version of this PR. Can you review this one?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jsquyres
Copy link
Member

jsquyres commented Aug 7, 2018

@bwbarrett Good to go.

@bwbarrett bwbarrett merged commit 07fca59 into open-mpi:v3.1.x Aug 7, 2018
@kawashima-fj kawashima-fj deleted the pr/v3.1.x/pending-bsend-fix branch August 17, 2018 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants