Skip to content

Commit 63a0a7c

Browse files
committed
osc/rdma: make locking code more robust
Under heavy load the locking code could fail if the underlying btl module started to return OPAL_ERR_OUT_OF_RESOURCE on atomic operations. This commit updates the code to gracefully handle btl errors. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 4707c7c) Signed-off-by: Nathan Hjelm <[email protected]>
1 parent 97e48bf commit 63a0a7c

File tree

2 files changed

+163
-141
lines changed

2 files changed

+163
-141
lines changed

ompi/mca/osc/rdma/osc_rdma.h

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
* University of Stuttgart. All rights reserved.
99
* Copyright (c) 2004-2005 The Regents of the University of California.
1010
* All rights reserved.
11-
* Copyright (c) 2007-2016 Los Alamos National Security, LLC. All rights
11+
* Copyright (c) 2007-2017 Los Alamos National Security, LLC. All rights
1212
* reserved.
1313
* Copyright (c) 2010 Cisco Systems, Inc. All rights reserved.
1414
* Copyright (c) 2012-2013 Sandia National Laboratories. All rights reserved.
@@ -511,4 +511,12 @@ static inline void ompi_osc_rdma_aggregation_return (ompi_osc_rdma_aggregation_t
511511
opal_free_list_return(&mca_osc_rdma_component.aggregate, (opal_free_list_item_t *) aggregation);
512512
}
513513

514+
515+
__opal_attribute_always_inline__
516+
static bool ompi_osc_rdma_oor (int rc)
517+
{
518+
/* check for OPAL_SUCCESS first to short-circuit the statement in the common case */
519+
return (OPAL_SUCCESS != rc && (OPAL_ERR_OUT_OF_RESOURCE == rc || OPAL_ERR_TEMP_OUT_OF_RESOURCE == rc));
520+
}
521+
514522
#endif /* OMPI_OSC_RDMA_H */

0 commit comments

Comments
 (0)