Skip to content

Conversation

@tkordenbrock
Copy link
Member

This PR adds a timeout to each fragment of a rendezvous get. If any fragment times out or fails, the entire receive fails.

Signed-off-by: Todd Kordenbrock [email protected]

plesn and others added 4 commits July 9, 2017 22:12
If the a frag cannot be retried because the ni_fail_type is other than
PTL_NI_DROPPED, then set the return type and jump to callback_error.
This sets MPI_ERROR and completes the receive.

Signed-off-by: Todd Kordenbrock <[email protected]>
Rearrange the receive frag timeout logic to avoid calling
opal_timer_base_get_usec() in read_msg().  Instead set it at the first
retry.

Signed-off-by: Todd Kordenbrock <[email protected]>
@hppritcha
Copy link
Member

@regrant is this an enhancement? Adding a new MCA parameter to releases that are now in bug fix only mode (2.0.x and 2.x) makes accepting this PR into those release problematic.

@regrant
Copy link
Contributor

regrant commented Jul 13, 2017

@hppritcha I agree this should go into 3.X not the 2.X series. It's an enhancement to a previous bug fix to make it more efficient and therefore an enhancement.

Copy link
Contributor

@regrant regrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is good to go, tested and approved.

@bwbarrett
Copy link
Member

This looks like a PR for master. You should probably remove the target:3.0 tag and file a new PR for the merge to 3.0 once this is in master (anyone can approve the pull into master, so that's all you...).

@regrant
Copy link
Contributor

regrant commented Jul 13, 2017

@bwbarrett thanks I missed the master tag, will do.

@regrant regrant merged commit 0ce8590 into open-mpi:master Jul 13, 2017
@tkordenbrock tkordenbrock deleted the topic/master/get.retry.timeout branch July 19, 2017 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants