Skip to content

Commit a10f6af

Browse files
committed
osc/rdma: fix some threading bugs
There were two bugs in osc/rdma when using threads: - Deadlock is ompi_osc_rdma_start_atomic. This occurs because ompi_osc_rdma_frag_alloc is called with the module lock. To fix the issue the module lock is now recursive. In the future I will add a new lock to protect just the current rdma fragment. - Do not drop the lock in ompi_osc_rdma_frag_alloc when calling ompi_osc_rdma_frag_complete. Not only is it not needed but dropping the lock at this point can cause a competing thread to mess up the state. (cherry picked from commit open-mpi/ompi@9ef0821) Signed-off-by: Nathan Hjelm <[email protected]>
1 parent bbe134d commit a10f6af

File tree

2 files changed

+1
-3
lines changed

2 files changed

+1
-3
lines changed

ompi/mca/osc/rdma/osc_rdma_component.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -998,7 +998,7 @@ static int ompi_osc_rdma_component_select (struct ompi_win_t *win, void **base,
998998
}
999999

10001000
/* initialize the objects, so that always free in cleanup */
1001-
OBJ_CONSTRUCT(&module->lock, opal_mutex_t);
1001+
OBJ_CONSTRUCT(&module->lock, opal_recursive_mutex_t);
10021002
OBJ_CONSTRUCT(&module->outstanding_locks, opal_hash_table_t);
10031003
OBJ_CONSTRUCT(&module->pending_posts, opal_list_t);
10041004
OBJ_CONSTRUCT(&module->peer_lock, opal_mutex_t);

ompi/mca/osc/rdma/osc_rdma_frag.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,7 @@ static inline int ompi_osc_rdma_frag_alloc (ompi_osc_rdma_module_t *module, size
7373
module->rdma_frag = NULL;
7474

7575
if (curr) {
76-
OPAL_THREAD_UNLOCK(&module->lock);
7776
ompi_osc_rdma_frag_complete (curr);
78-
OPAL_THREAD_LOCK(&module->lock);
7977
}
8078

8179
item = opal_free_list_get (&mca_osc_rdma_component.frags);

0 commit comments

Comments
 (0)