-
Notifications
You must be signed in to change notification settings - Fork 4.1k
admission: move yielding into ElasticCPUWorkHandle #159917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if !ok { | ||
| tenantID = roachpb.SystemTenantID | ||
| } | ||
| return db.AdmissionPacerFactory.NewPacer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AdmissionPacerFactory can be nil as well: https://github.com/cockroachdb/cockroach/blob/master/pkg/server/tenant.go#L222-L223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
Is this a concern, in that were you trying to make yield work for SQL pods in serverless?
If yes, I'll look into fixing that old todo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my version would yield in pods even if the rest of elastic AC was otherwise not hooked up. But I don’t know if this is all that important since perhaps the right answer is just to aim to hook up a real elastic granter in all sql servers — including pods - and then be able to assume it is never nil?
But I don’t think that needs to happen here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: I’m 👍 merging as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I added a commit that creates a real elastic grant coordinator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped this commit since I suspect it was the cause of some test failures that were hard to track down. I'll revive it later.
b8d95e8 to
fbef5ab
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
@sumeerbhola made 3 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @golgeek, @tbg, and @williamchoe3).
| if !ok { | ||
| tenantID = roachpb.SystemTenantID | ||
| } | ||
| return db.AdmissionPacerFactory.NewPacer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
Is this a concern, in that were you trying to make yield work for SQL pods in serverless?
If yes, I'll look into fixing that old todo.
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
fbef5ab to
4ba4340
Compare
6123dc8 to
95628b7
Compare
|
Won't be able to review this until I'm back from PTO, feel free to merge this with @dt's LGTM. |
ElasticCPUWorkHandle.Overlimit is expected to be called in a tight loop, so yielding there is conceptually the right place. More importantly, this will allow in the future for KV work that is not holding latches to also yield. As part of this change, elastic work that does not wish to wait in admission control queues (due to cluster settings), is now accounted for in the elastic tokens, and in the admission.elastic_cpu_bypassed.utilization metric. One side-effect of this accounting is that work that needs to wait in admission queues may have fewer tokens available to it, and may wait longer. This is considered acceptable since: - Elastic work that bypasses queueing is still elastic work, and our overarching goal is to reduce impact to foreground work. - Due to the default on use of runtime.Yield, all elastic work yields, which allows the system to run at higher elastic CPU utilization without impacting the latency of foreground work. Epic: none Release note: None
95628b7 to
d627a86
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because this is rearranging existing functionality? Or is this PR just very light on testing?
Both, in that it is rearrangement, and AFAIK the yield stuff does not have any existing automated testing (I'll rerun @dt 's test https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1767716152730549?thread_ts=1766160955.465809&cid=C01SRKWGHG8).
@sumeerbhola made 4 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @dt, @golgeek, @kyle-a-wong, @tbg, @williamchoe3, and @xinhaoz).
pkg/util/admission/elastic_cpu_work_handle.go line 118 at r4 (raw file):
// TODO(irfansharif): Non-test callers use one or the other return value, not // both. Split this API? func (h *ElasticCPUWorkHandle) IsOverLimitAndPossiblyYield() (
Changed the name here.
pkg/util/admission/elastic_cpu_work_queue.go line 108 at r4 (raw file):
e.metrics.PreWorkNanos.Inc(h.preWork.Nanoseconds()) _, difference := h.overLimitInner()
Not using overlimitInner was a buglet even before this PR, in that it could return stale information if enough iterations hadn't happened. With this PR, we definitely don't want to yield here.
| if !ok { | ||
| tenantID = roachpb.SystemTenantID | ||
| } | ||
| return db.AdmissionPacerFactory.NewPacer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I added a commit that creates a real elastic grant coordinator.
67b6f6c to
d627a86
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @dt, @golgeek, @kyle-a-wong, @tbg, @williamchoe3, and @xinhaoz).
| if !ok { | ||
| tenantID = roachpb.SystemTenantID | ||
| } | ||
| return db.AdmissionPacerFactory.NewPacer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped this commit since I suspect it was the cause of some test failures that were hard to track down. I'll revive it later.
Done and verified no behavior change. |
|
bors r=dt |
ElasticCPUWorkHandle.Overlimit is expected to be called in a tight loop, so yielding there is conceptually the right place. More importantly, this will allow in the future for KV work that is not holding latches to also yield.
As part of this change, elastic work that does not wish to wait in admission control queues (due to cluster settings), is now accounted for in the elastic tokens, and in the admission.elastic_cpu_bypassed.utilization metric. One side-effect of this accounting is that work that needs to wait in admission queues may have fewer tokens available to it, and may wait longer. This is considered acceptable since:
Epic: none
Release note: None