-
Notifications
You must be signed in to change notification settings - Fork 4.1k
sql: introduce span-level checks and level-cluster checks #159319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
fb47508 to
f45cca8
Compare
spilchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spilchen made 2 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).
pkg/sql/inspect/inspect_processor.go line 212 at r1 (raw file):
// Run any post-checks after all spans have been processed. if err := p.processPostChecks(ctx); err != nil {
The handling has to happen at the coordinator. We don't have a cluster-wide view if we try to process the meta checks inside the per-node inspect processor. inspect_processor.go runs on each node, so it can’t do cluster-level row count validation. That has to come from the coordinator.
The idea was to piggyback on the existing check (index_consistency_check.go). That check already gathers row counts for the hash step (see buildIndexHashQuery). Right now those counts are thrown away after span processing. Instead, we should accumulate them and send them back as part of the progress stream. That means adding some state to InspectProcessorProgress.
We can do this unconditionally whenever index_consistency_check.go runs. The coordinator then decides what to do: if row count validation is enabled, it processes the aggregated counts; if not, it just ignores them. We can probably still introduce a new type of check for this (clusterChecks?) to keep the code clean.
f45cca8 to
a87b6de
Compare
a87b6de to
91f097d
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
This change exposes the row counts computed during the index consistency check for use by other checks on the span. Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None
The inspect resumer needs to be able to log issues. The logger can use the `ExecutorConfig` embedded in the context without changing its behavior. Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None
91f097d to
117bfad
Compare
spilchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look. Although, not sure if this is ready for another review.
@spilchen partially reviewed 3 files, made 3 comments, and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).
pkg/jobs/jobspb/jobs.go line 182 at r4 (raw file):
// UpdateWithSpanProgress updates the InspectProgress with data from the // check-specific fields in the span progress update. func (p *InspectProgress) UpdateWithSpanProgress(prog *InspectProcessorProgress) {
This seems like a placeholder in this PR. I assume a follow-on PR will implement this, which I'm guessing it will extract out the row count. Does it need to be a first class function of InspectProgress? I'm not a huge fan of adding it to the jobspb package. That's typically reserved for general purpose helpers. This seems to specific to the inspect progress. Could we instead just add a function to handle this in progress.go? Something like:
func updateProgressWithSpanData(p *jobspb.InspectProgress, prog *jobspb.InspectProcessorProgress) {
// future: aggregate row counts, etc.
}
pkg/sql/inspect/inspect_processor.go line 292 at r4 (raw file):
for _, check := range checks { if check, ok := check.(inspectSpanCheck); ok { if err := check.RegisterChecksForSpan(checks); err != nil {
Why do we need to register each check beforehand? Could we not pass the check into CheckSpan and not do this pre-register at all?
bghal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bghal made 2 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @kev-cao, and @spilchen).
pkg/jobs/jobspb/jobs.go line 182 at r4 (raw file):
Previously, spilchen wrote…
This seems like a placeholder in this PR. I assume a follow-on PR will implement this, which I'm guessing it will extract out the row count. Does it need to be a first class function of InspectProgress? I'm not a huge fan of adding it to the
jobspbpackage. That's typically reserved for general purpose helpers. This seems to specific to the inspect progress. Could we instead just add a function to handle this inprogress.go? Something like:func updateProgressWithSpanData(p *jobspb.InspectProgress, prog *jobspb.InspectProcessorProgress) { // future: aggregate row counts, etc. }
Yep used in #159320.
Moved the helper to progress.go.
pkg/sql/inspect/inspect_processor.go line 292 at r4 (raw file):
Previously, spilchen wrote…
Why do we need to register each check beforehand? Could we not pass the check into
CheckSpanand not do this pre-register at all?
It was used to validate the checks (e.g. ensure dependencies on other checks were satisfied). But can move that into the CheckSpan.
117bfad to
7966099
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
spilchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments on the changes made to the coordinator. I think this is starting to take a good shape though.
@spilchen partially reviewed 10 files and all commit messages, made 9 comments, and resolved 2 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).
pkg/sql/inspect/progress.go line 260 at r6 (raw file):
inspectProgress.JobCompletedCheckCount += int64(count) // Update cached progress - the goroutine will handle actual persistence.
does the goroutine have a chance to flush the cache? This happens at the very end of inspect. I think the caller was doing a flush anyway. Probably makes sense to call the flush here instead.
pkg/sql/inspect/progress.go line 458 at r6 (raw file):
if checker.IsSpanLevel() { totalApplicableChecks++
is this correct, doesn't this cause us to double count span level checks? If it is correct, we should have a comment here explaining why, since it seems odd.
pkg/sql/inspect/progress.go line 467 at r6 (raw file):
} // countApplicableSpanChecks determines how many cluster-level checks apply.
nit: copy/paste error here, the function name in the comment doesn't match the actual function name.
pkg/sql/inspect/inspect_resumer.go line 86 at r6 (raw file):
// After planning, we have the finalized set of spans to process (adjacent // spans on the same node are merged). Compute the checks to run and initialize // progress tracking from the plan.
So, we skip this part (initialize progress tracking) if we resume an inspect but all of the spans were done. Does this cause problems later in processClusterChecks because the progress cache isn't setup? Do we have a test that pauses (then resumes) the inspect job after completing the last span but before we have done the cluster level check?
pkg/sql/inspect/inspect_resumer.go line 448 at r6 (raw file):
// Get the inspect progress inspectProgress, ok := progressTracker.job.Progress().Details.(*jobspb.Progress_Inspect)
should we be getting the cached progress from the tracker rather than the one from storage? This may work since I think the final progress causes an immediate write to storage. But I would think it's more efficient to use the cache anyway. Or maybe it's done this way to handle restarts?
pkg/sql/inspect/inspect_resumer.go line 464 at r6 (raw file):
}() // Run each cluster check
This runner loop is similar to the one in pkg/sql/inspect/runner.go. Except this one works on cluster level checks. I think it would make sense if the bulk of this function was moved to a separate file (cluster_runner.go?) to keep the same level of abstraction. It would allow you to create unit tests to drive this logic too without having to standup a full inspect job instance.
pkg/sql/inspect/inspect_resumer.go line 479 at r6 (raw file):
return errors.Wrapf(err, "error running cluster inspect check") } err = loggerBundle.logIssue(ctx, issue)
what if we didn't find an issue? i.e. issue is nil
pkg/sql/inspect/inspect_resumer.go line 495 at r6 (raw file):
log.Dev.Infof(ctx, "INSPECT resumer completed issuesFound=%t", loggerBundle.hasIssues()) if loggerBundle.hasIssues() {
Can this have issues that were found prior to processing the cluster level checks? If so, I don't think we should be checking for the issues here. It's not the correct layer of abstraction. I think it would make more sense to let the caller deal with this.
Previously the checks were independent and could only process individual spans. The row count check will be dependent on other checks that perform the scan of each span and will need to aggregate the row counts after all the spans have been processed. This change sets up interfaces for span-level (dependent on other checks running on the spans) and cluster-level checks (run after all spans have been processed). Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None
7966099 to
72404cb
Compare
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
bghal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bghal made 7 comments and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @kev-cao, and @spilchen).
pkg/sql/inspect/inspect_resumer.go line 86 at r6 (raw file):
Previously, spilchen wrote…
So, we skip this part (initialize progress tracking) if we resume an inspect but all of the spans were done. Does this cause problems later in
processClusterChecksbecause the progress cache isn't setup? Do we have a test that pauses (then resumes) the inspect job after completing the last span but before we have done the cluster level check?
Initialized the progress tracker in both branches.
pkg/sql/inspect/inspect_resumer.go line 448 at r6 (raw file):
Previously, spilchen wrote…
should we be getting the cached progress from the tracker rather than the one from storage? This may work since I think the final progress causes an immediate write to storage. But I would think it's more efficient to use the cache anyway. Or maybe it's done this way to handle restarts?
Changed to use the cache for the progress.
pkg/sql/inspect/inspect_resumer.go line 464 at r6 (raw file):
Previously, spilchen wrote…
This runner loop is similar to the one in pkg/sql/inspect/runner.go. Except this one works on cluster level checks. I think it would make sense if the bulk of this function was moved to a separate file (cluster_runner.go?) to keep the same level of abstraction. It would allow you to create unit tests to drive this logic too without having to standup a full inspect job instance.
Did that and added tests.
pkg/sql/inspect/inspect_resumer.go line 479 at r6 (raw file):
Previously, spilchen wrote…
what if we didn't find an issue? i.e.
issueisnil
Done.
pkg/sql/inspect/inspect_resumer.go line 495 at r6 (raw file):
Previously, spilchen wrote…
Can this have issues that were found prior to processing the cluster level checks? If so, I don't think we should be checking for the issues here. It's not the correct layer of abstraction. I think it would make more sense to let the caller deal with this.
No, the logger bundle is instantiated for the resumer (that's what #160095 was for). I refactored this into the inspectLoggerBundle since both the resumer and the processors used it.
pkg/sql/inspect/progress.go line 260 at r6 (raw file):
Previously, spilchen wrote…
does the goroutine have a chance to flush the cache? This happens at the very end of inspect. I think the caller was doing a flush anyway. Probably makes sense to call the flush here instead.
Moved the flush into the function.
pkg/sql/inspect/progress.go line 458 at r6 (raw file):
Previously, spilchen wrote…
is this correct, doesn't this cause us to double count span level checks? If it is correct, we should have a comment here explaining why, since it seems odd.
Yeah they're counted twice since they're run as both a regular check and after all the regular checks have run.
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
Previously the checks were independent and could only process individual
spans.
The row count check will be dependent on other checks that perform the
scan of each span and will need to aggregate the row counts after all
the spans have been processed.
This change sets up interfaces for span-level (dependent on other
checks running on the spans) and cluster-level checks (run after all spans have
been processed).
Part of: #155472
Epic: CRDB-55075
Release note: None