Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Feb 12, 2025

Fixes one of the remaining issues I see on our hardware. With these commits it runs cleanly with RMA-MT.

Note the second commit is not really a functional change but a cleanup change.

This commit fixes a regression I introduced when changing how the endpoint flags
are set. There is no guarantee when the remote completion message comes in that
the local endpoint has been connected (only the remote endpoint). Instead of
always setting the endpoint as ready here the code now only sets that the remote
is connected and then sets the endpoint as ready only if both the local and
remote endpoints are connected. When the local endpoint is connected it
similarly checks that the remote is connected before setting the endpoint as
ready.

This was verified using RMA-MT with 128 threads and it does fix the race.

Signed-off-by: Nathan Hjelm <[email protected]>
Connection endpoints are created to assist with connection connect-to-endpoint
UCT endpoints. In most situations it is created and destroyed for each thread
context but it may be used by multiple simultaneous threads. The code retains
and releases the endpoint at multiple points where needed. This is overkill.
This commit updates the code to retain once per connecting thread (or the
original creation) and then releases when the connection is complete.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn merged commit 9fb2d4e into open-mpi:main Feb 12, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants