part: multithreading deadlocks fixes and safety checks #12935
+172
−148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes multiple deadlocks and issues encountered with partitioned communications.
The first deadlocks occur when one thread is in
opal_progressand others are working on the partitioned request:MPI_Preadycould change thereq->flagsof a partition while the progress thread is testing it, leading to an edge case wherereq->done_countwould be greater than the number of partitionsMPI_Preadycould be overwritten by the progress threadBoth were fixed by adding the array
req->part_readywhereMPI_Preadywould mark the partitions ready to be sent.This prevents the progress engine from touching the state of a partition as long as it isn't ready.
Since no atomic operations were added, this should have little to no impact on the performance.
I fixed another deadlock that I rarely encountered at the initialization of the part module: two
ompi_comm_idupneed to be done, both are started in the progress engine, and will prevent progressing partitioned requests until they are done.Sometimes the second
ompi_comm_idupwould be marked as completed in one rank, but not in the other, leading to a deadlock.This was fixed by doing one
ompi_comm_idupat a time, with the side-effect of slowing down the initialization (the first request must be done before starting the nextompi_comm_idup).A very rare segfault caused by calling
mca_part_persist_free_reqwhileompi_part_persist.lockwas unlocked was also fixed.I added many error checks to
mca_part_persist_progress, ensuring no new deadlock can occur when an internal function fails.For testing, I used a two-way ring exchange with a fixed number of partitions distributed among multiple OpenMP threads. This was my original use-case for which I encountered all those issues since all threads need to call
MPI_PreadyandMPI_Parrived.Using up to 128 cores with varying amounts of processes and threads showed no new deadlocks or communication issues.