Skip to content

Conversation

@sumeerbhola
Copy link
Collaborator

ElasticCPUWorkHandle.Overlimit is expected to be called in a tight loop, so yielding there is conceptually the right place. More importantly, this will allow in the future for KV work that is not holding latches to also yield.

As part of this change, elastic work that does not wish to wait in admission control queues (due to cluster settings), is now accounted for in the elastic tokens, and in the admission.elastic_cpu_bypassed.utilization metric. One side-effect of this accounting is that work that needs to wait in admission queues may have fewer tokens available to it, and may wait longer. This is considered acceptable since:

  • Elastic work that bypasses queueing is still elastic work, and our overarching goal is to reduce impact to foreground work.
  • Due to the default on use of runtime.Yield, all elastic work yields, which allows the system to run at higher elastic CPU utilization without impacting the latency of foreground work.

Epic: none

Release note: None

@sumeerbhola sumeerbhola requested review from dt and tbg December 19, 2025 19:53
@sumeerbhola sumeerbhola requested review from a team as code owners December 19, 2025 19:53
@sumeerbhola sumeerbhola requested review from golgeek and williamchoe3 and removed request for a team December 19, 2025 19:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

if !ok {
tenantID = roachpb.SystemTenantID
}
return db.AdmissionPacerFactory.NewPacer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
Is this a concern, in that were you trying to make yield work for SQL pods in serverless?
If yes, I'll look into fixing that old todo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my version would yield in pods even if the rest of elastic AC was otherwise not hooked up. But I don’t know if this is all that important since perhaps the right answer is just to aim to hook up a real elastic granter in all sql servers — including pods - and then be able to assume it is never nil?

But I don’t think that needs to happen here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: I’m 👍 merging as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I added a commit that creates a real elastic grant coordinator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped this commit since I suspect it was the cause of some test failures that were hard to track down. I'll revive it later.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

@sumeerbhola made 3 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @golgeek, @tbg, and @williamchoe3).

if !ok {
tenantID = roachpb.SystemTenantID
}
return db.AdmissionPacerFactory.NewPacer(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
Is this a concern, in that were you trying to make yield work for SQL pods in serverless?
If yes, I'll look into fixing that old todo.

@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Dec 19, 2025
@sumeerbhola sumeerbhola added the O-AI-Review-Real-Issue-Found AI reviewer found real issue label Dec 19, 2025
@sumeerbhola sumeerbhola force-pushed the yield_in_handle branch 2 times, most recently from 6123dc8 to 95628b7 Compare December 20, 2025 00:15
@sumeerbhola sumeerbhola requested a review from a team as a code owner December 20, 2025 00:15
@sumeerbhola sumeerbhola requested review from DrewKimball and removed request for a team December 20, 2025 00:15
@tbg
Copy link
Member

tbg commented Dec 24, 2025

Won't be able to review this until I'm back from PTO, feel free to merge this with @dt's LGTM.
I see a large amount of changes but no changes in tests. Is this because this is rearranging existing functionality? Or is this PR just very light on testing?

ElasticCPUWorkHandle.Overlimit is expected to be called in a tight loop,
so yielding there is conceptually the right place. More importantly,
this will allow in the future for KV work that is not holding latches
to also yield.

As part of this change, elastic work that does not wish to wait in
admission control queues (due to cluster settings), is now accounted for
in the elastic tokens, and in the admission.elastic_cpu_bypassed.utilization
metric. One side-effect of this accounting is that work that needs to
wait in admission queues may have fewer tokens available to it, and may
wait longer. This is considered acceptable since:
- Elastic work that bypasses queueing is still elastic work, and our
  overarching goal is to reduce impact to foreground work.
- Due to the default on use of runtime.Yield, all elastic work yields,
  which allows the system to run at higher elastic CPU utilization without
  impacting the latency of foreground work.

Epic: none

Release note: None
@sumeerbhola sumeerbhola requested a review from a team as a code owner January 6, 2026 16:19
@sumeerbhola sumeerbhola requested a review from xinhaoz January 6, 2026 16:19
@sumeerbhola sumeerbhola requested a review from a team as a code owner January 6, 2026 16:38
@sumeerbhola sumeerbhola requested review from kyle-a-wong and removed request for a team January 6, 2026 16:38
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because this is rearranging existing functionality? Or is this PR just very light on testing?

Both, in that it is rearrangement, and AFAIK the yield stuff does not have any existing automated testing (I'll rerun @dt 's test https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1767716152730549?thread_ts=1766160955.465809&cid=C01SRKWGHG8).

@sumeerbhola made 4 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @dt, @golgeek, @kyle-a-wong, @tbg, @williamchoe3, and @xinhaoz).


pkg/util/admission/elastic_cpu_work_handle.go line 118 at r4 (raw file):

// TODO(irfansharif): Non-test callers use one or the other return value, not
// both. Split this API?
func (h *ElasticCPUWorkHandle) IsOverLimitAndPossiblyYield() (

Changed the name here.


pkg/util/admission/elastic_cpu_work_queue.go line 108 at r4 (raw file):

	e.metrics.PreWorkNanos.Inc(h.preWork.Nanoseconds())
	_, difference := h.overLimitInner()

Not using overlimitInner was a buglet even before this PR, in that it could return stale information if enough iterations hadn't happened. With this PR, we definitely don't want to yield here.

if !ok {
tenantID = roachpb.SystemTenantID
}
return db.AdmissionPacerFactory.NewPacer(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I added a commit that creates a real elastic grant coordinator.

@sumeerbhola sumeerbhola force-pushed the yield_in_handle branch 5 times, most recently from 67b6f6c to d627a86 Compare January 7, 2026 17:21
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @dt, @golgeek, @kyle-a-wong, @tbg, @williamchoe3, and @xinhaoz).

if !ok {
tenantID = roachpb.SystemTenantID
}
return db.AdmissionPacerFactory.NewPacer(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped this commit since I suspect it was the cause of some test failures that were hard to track down. I'll revive it later.

@sumeerbhola
Copy link
Collaborator Author

@sumeerbhola
Copy link
Collaborator Author

bors r=dt

@craig
Copy link
Contributor

craig bot commented Jan 7, 2026

@craig craig bot merged commit f3f12a6 into cockroachdb:master Jan 7, 2026
39 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. O-AI-Review-Real-Issue-Found AI reviewer found real issue target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants