Skip to content

Conversation

@msbutler
Copy link
Collaborator

@msbutler msbutler commented Nov 25, 2025

This patch reconfigures the jobs/stress roachtest logic in a few ways:

  • now runs on a 20 node cluster in the weekly suite
  • lowers the base interval to 0.1, to simulate internal job query contention at
    a 200 node scale
  • refines the job control loop logic to pause 20% of running changefeeds,
    resume 20% of paused changefeeds, recreate up to 200 canceled changefeeds,
    per iteration.
  • fails the roachtest if a job stays unclaimed for more than 5 minutes.

Informs: #158976

Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-job-stress-steroids branch 2 times, most recently from 10ef427 to 3006d24 Compare November 25, 2025 14:38
@msbutler msbutler self-assigned this Nov 25, 2025
@msbutler msbutler changed the title exp job stress test repro Nov 25, 2025
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Nov 25, 2025
@msbutler msbutler force-pushed the butler-job-stress-steroids branch from 897a4bf to c2a36a1 Compare January 7, 2026 19:21
@msbutler msbutler changed the title job stress test repro roachtest: beef up jobs/stress roachtest Jan 7, 2026
@msbutler
Copy link
Collaborator Author

msbutler commented Jan 7, 2026

this test sufficiently stresses the job system, as the job adoption rate is significantly decreased (but not flatlined)
image

we also see plenty of logs indicating claim query timeouts (added in #160084), for example:

logs/cockroach.butler-test-2-0019.ubuntu.2026-01-07T18_53_13Z.003497.log:E260107 19:01:44.473771 848 jobs/registry.go:992 ⋮ [T1,Vsystem,n19] 2091  error claiming jobs: could not query jobs table: claim-jobs: failed to read query result: query execution canceled

This patch reconfigures the jobs/stress roachtest logic in a few ways:
- now runs on a 20 node cluster in the weekly suite
- lowers the base interval to 0.1, to simulate internal job query contention at
  a 200 node scale
- refines the job control loop logic to pause 20% of running changefeeds,
  resume 20% of paused changefeeds, recreate up to 200 canceled changefeeds,
per iteration.
- fails the roachtest if a job stays unclaimed for more than 5 minutes.

Informs: cockroachdb#158976

Release note: none
@msbutler msbutler force-pushed the butler-job-stress-steroids branch from c2a36a1 to 90b6904 Compare January 7, 2026 19:52
@msbutler msbutler marked this pull request as ready for review January 7, 2026 20:46
@msbutler msbutler requested a review from dt January 7, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants