-
Notifications
You must be signed in to change notification settings - Fork 338
DAOS-18388 client: handle timeout for CRT_OPC_PROTO_QUERY RPC #17335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ticket title is 'daos_rpc_proto_query() crt_proto_query()failed: DER_TIMEDOUT(-1011): 'Time out'' |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/451/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/466/log |
b408c13 to
0863dd6
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17335/5/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/5/execution/node/1176/log |
bd43c12 to
bc0c86f
Compare
knard38
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/client/api/rpc.c
Outdated
|
|
||
| /* More retry to the first timeout rank with default timeout. */ | ||
| rank = rproto->first_timeout_rank; | ||
| rproto->timeout = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be rproto->timeout = timeout; in this case? the timeout queried from line 137 will be the 'default timeout' that you mention on line 151
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use rproto->timeout as a flag to indicate we have retried as L142. Related cart level logic will automatically set the new RPC timeout as the default timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why you need to treat 0 here as a special value.
if you set rproto->timeout = timeout, the logic on line 142 will still trigger on a next iteration ((timeout > 0 && timeout <= rproto->timeout)) part.
My concern is that setting it to 0 can lead to issues if someone later decided to do for example '+3', and instead of 'default timeout'+3 you now end up with a timeout of 3 seconds now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see your concern. I will refresh the patch.
|
Ping reviewers, thanks! |
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried. The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value. Signed-off-by: Fan Yong <[email protected]>
bc0c86f to
379e8ae
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17335/10/testReport/ |
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried.
The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value.
Steps for the author:
After all prior steps are complete: