-
Notifications
You must be signed in to change notification settings - Fork 9.1k
YARN-11844: Support configuration of retry policy on GPU discovery #7857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🎊 +1 overall
This message was automatically generated. |
Closes apache#7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]>
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
final String msg = "Failed to execute GPU device detection script"; | ||
|
||
// The default max errors is 10. Verify that it keeps going for an 11th try. | ||
for (int i = 0; i < 11; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this 11 to 15 & still the test doesn't fail for me, can you check once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is covering the case where you disable the max errors by setting a negative value. To make this clearer, I dialed it up to 20 attempts, and I also added another test that sets the configuration to 11 and confirms it tries exactly 11 times.
yarn.nodemanager.resource-plugins.gpu.discovery-max-errors. | ||
</description> | ||
<name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name> | ||
<value>10000ms</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for not using 10s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@ayushtkn , thanks for the review! All comments addressed, and +1 from Yetus. How does this look now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes #7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]> Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit 0f34922)
Closes #7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]> Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit 0f34922) (cherry picked from commit f2b69ba)
I committed to trunk and merged to branch-3.4 and branch-3.3 with some minor conflicts resolved. @ayushtkn and @hotcodemacha , thank you for the reviews. |
Description of PR
The NodeManager invokes an external binary (e.g.
nvidia-smi
) to discover attached GPUs. Right now, there is a hard-coded 10-second timeout on execution of this binary and a hard-coded max error count of 10, beyond which the NodeManager will stop attempting discovery. This change will provide new configuration properties to control both the timeout and the max errors, which is useful in environments where there may be a delay in binding the GPU to the host. Default values for the new configuration properties will be set so as to maintain the existing behavior.Special thanks to @jjayadeep06 for co-authoring this patch.
How was this patch tested?
New unit test and manual testing.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?