Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jan 4, 2026

Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried.

The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

Ticket title is 'daos_rpc_proto_query() crt_proto_query()failed: DER_TIMEDOUT(-1011): 'Time out''
Status is 'In Review'
Labels: 'scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-18388

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/451/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/466/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch from b408c13 to 0863dd6 Compare January 5, 2026 06:39
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/5/execution/node/1176/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch 2 times, most recently from bd43c12 to bc0c86f Compare January 6, 2026 05:25
@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 7, 2026 03:30
@Nasf-Fan Nasf-Fan requested review from a team as code owners January 7, 2026 03:30
knard38
knard38 previously approved these changes Jan 7, 2026
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


/* More retry to the first timeout rank with default timeout. */
rank = rproto->first_timeout_rank;
rproto->timeout = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be rproto->timeout = timeout; in this case? the timeout queried from line 137 will be the 'default timeout' that you mention on line 151

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use rproto->timeout as a flag to indicate we have retried as L142. Related cart level logic will automatically set the new RPC timeout as the default timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you need to treat 0 here as a special value.

if you set rproto->timeout = timeout, the logic on line 142 will still trigger on a next iteration ((timeout > 0 && timeout <= rproto->timeout)) part.

My concern is that setting it to 0 can lead to issues if someone later decided to do for example '+3', and instead of 'default timeout'+3 you now end up with a timeout of 3 seconds now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see your concern. I will refresh the patch.

@Nasf-Fan Nasf-Fan requested a review from frostedcmos January 9, 2026 03:04
@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 9, 2026

Ping reviewers, thanks!

@Nasf-Fan Nasf-Fan requested a review from jolivier23 January 9, 2026 05:33
@frostedcmos frostedcmos requested a review from mchaarawi January 9, 2026 17:58
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only
3 seconds, it will help to get going more quickly when some rank(s)
is down. But that increases the risk of query failure with timeout
if there are only a few targets in the system and they may be busy
or not ready in time when being queried.

The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against
the rank that has ever reported RPC timeout. Such retry will use
default RPC timeout configuration instead of initial small value.

Signed-off-by: Fan Yong <[email protected]>
@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants