-
Notifications
You must be signed in to change notification settings - Fork 9.1k
YARN-11843. Fix potential deadlock when auto-correction of container allocation is enabled #7855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…allocation is enabled
@shameersss1 @slfan1989 Could you please help to review this PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
...main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
Outdated
Show resolved
Hide resolved
@TaoYang526 - It is good catch indeed. |
@shameersss1 Currently, we don't have a specific use case for the auto-correction feature. We have always been aware that YARN has an over-allocation issue, but considered it a minor concern, so no action was taken. When I discovered YARN-11702, I wanted to take a look and found that it could resolve most over-allocation issues, except for some in-scheduling cases. It's sufficient to reduce unnecessary consumption in the scheduling cycle and we may enable this feature in the future. |
Thanks for the details. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
💔 -1 overall
This message was automatically generated. |
completedContainer(rmContainer, SchedulerUtils.createAbnormalContainerStatus( | ||
rmContainer.getContainerId(), SchedulerUtils.EXPIRED_CONTAINER), | ||
RMContainerEventType.EXPIRE); | ||
asyncContainerRelease(rmContainer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it cause the message at the end of the container to be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the message will be updated from "Container expired since it was unused" to "Container released by application", which I believe is more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the message will be updated from "Container expired since it was unused" to "Container released by application", which I believe is more appropriate.
No, I think the old is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the old one, "expired since it was unused" is not accurate because it was released due to over-allocation, not because the waiting period expired. For the new one, "Container released by application" has been used in different scenarios that actively released by application, I think over-allocation could be included in this scope. @shameersss1 could you please share your thoughts about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaoYang526 +1
Container released makes more sense here.
@slfan1989 @aajisaka Could you please help review this PR when you have time? |
…er#recoverResourceRequestForContainer and add UT.
Updated the expected state in AbstractYarnScheduler#recoverResourceRequestForContainer and added UT for that. |
💔 -1 overall
This message was automatically generated. |
Details please refer to YARN-11843.
Description of PR
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?