Skip to content

Commit 5f5a650

Browse files
Hakon-Buggejgunthorpe
authored andcommitted
RDMA/core/sa_query: Retry SA queries
A MAD packet is sent as an unreliable datagram (UD). SA requests are sent as MAD packets. As such, SA requests or responses may be silently dropped. IB Core's MAD layer has a timeout and retry mechanism, which amongst other, is used by RDMA CM. But it is not used by SA queries. The lack of retries of SA queries leads to long specified timeout, and error being returned in case of packet loss. The ULP or user-land process has to perform the retry. Fix this by taking advantage of the MAD layer's retry mechanism. First, a check against a zero timeout is added in rdma_resolve_route(). In send_mad(), we set the MAD layer timeout to one tenth of the specified timeout and the number of retries to 10. The special case when timeout is less than 10 is handled. With this fix: # ucmatose -c 1000 -S 1024 -C 1 runs stable on an Infiniband fabric. Without this fix, we see an intermittent behavior and it errors out with: cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110 (110 is ETIMEDOUT) Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Håkon Bugge <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
1 parent f0a6419 commit 5f5a650

File tree

2 files changed

+11
-1
lines changed

2 files changed

+11
-1
lines changed

drivers/infiniband/core/cma.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3117,6 +3117,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, unsigned long timeout_ms)
31173117
struct rdma_id_private *id_priv;
31183118
int ret;
31193119

3120+
if (!timeout_ms)
3121+
return -EINVAL;
3122+
31203123
id_priv = container_of(id, struct rdma_id_private, id);
31213124
if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_RESOLVED, RDMA_CM_ROUTE_QUERY))
31223125
return -EINVAL;

drivers/infiniband/core/sa_query.c

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1304,14 +1304,21 @@ static int send_mad(struct ib_sa_query *query, unsigned long timeout_ms,
13041304
{
13051305
unsigned long flags;
13061306
int ret, id;
1307+
const int nmbr_sa_query_retries = 10;
13071308

13081309
xa_lock_irqsave(&queries, flags);
13091310
ret = __xa_alloc(&queries, &id, query, xa_limit_32b, gfp_mask);
13101311
xa_unlock_irqrestore(&queries, flags);
13111312
if (ret < 0)
13121313
return ret;
13131314

1314-
query->mad_buf->timeout_ms = timeout_ms;
1315+
query->mad_buf->timeout_ms = timeout_ms / nmbr_sa_query_retries;
1316+
query->mad_buf->retries = nmbr_sa_query_retries;
1317+
if (!query->mad_buf->timeout_ms) {
1318+
/* Special case, very small timeout_ms */
1319+
query->mad_buf->timeout_ms = 1;
1320+
query->mad_buf->retries = timeout_ms;
1321+
}
13151322
query->mad_buf->context[0] = query;
13161323
query->id = id;
13171324

0 commit comments

Comments
 (0)