sql: introduce span-level checks and level-cluster checks #159319

bghal · 2025-12-11T22:19:53Z

Previously the checks were independent and could only process individual
spans.
The row count check will be dependent on other checks that perform the
scan of each span and will need to aggregate the row counts after all
the spans have been processed.
This change sets up interfaces for span-level (dependent on other
checks running on the spans) and cluster-level checks (run after all spans have
been processed).

Part of: #155472
Epic: CRDB-55075

Release note: None

blathers-crl · 2025-12-11T22:19:57Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-12-11T22:20:07Z

This change is

pkg/sql/inspect/inspect_processor.go

spilchen

@spilchen made 2 comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).

pkg/sql/inspect/inspect_processor.go line 212 at r1 (raw file):

	// Run any post-checks after all spans have been processed.
	if err := p.processPostChecks(ctx); err != nil {

The handling has to happen at the coordinator. We don't have a cluster-wide view if we try to process the meta checks inside the per-node inspect processor. inspect_processor.go runs on each node, so it can’t do cluster-level row count validation. That has to come from the coordinator.

The idea was to piggyback on the existing check (index_consistency_check.go). That check already gathers row counts for the hash step (see buildIndexHashQuery). Right now those counts are thrown away after span processing. Instead, we should accumulate them and send them back as part of the progress stream. That means adding some state to InspectProcessorProgress.

We can do this unconditionally whenever index_consistency_check.go runs. The coordinator then decides what to do: if row count validation is enabled, it processes the aggregated counts; if not, it just ignores them. We can probably still introduce a new type of check for this (clusterChecks?) to keep the code clean.

pkg/sql/inspect/inspect_processor.go

github-actions · 2026-01-02T23:05:55Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

This change exposes the row counts computed during the index consistency check for use by other checks on the span. Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None

The inspect resumer needs to be able to log issues. The logger can use the `ExecutorConfig` embedded in the context without changing its behavior. Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None

spilchen

I took a quick look. Although, not sure if this is ready for another review.

@spilchen partially reviewed 3 files, made 3 comments, and resolved 1 discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).

pkg/jobs/jobspb/jobs.go line 182 at r4 (raw file):

// UpdateWithSpanProgress updates the InspectProgress with data from the
// check-specific fields in the span progress update.
func (p *InspectProgress) UpdateWithSpanProgress(prog *InspectProcessorProgress) {

This seems like a placeholder in this PR. I assume a follow-on PR will implement this, which I'm guessing it will extract out the row count. Does it need to be a first class function of InspectProgress? I'm not a huge fan of adding it to the jobspb package. That's typically reserved for general purpose helpers. This seems to specific to the inspect progress. Could we instead just add a function to handle this in progress.go? Something like:

func updateProgressWithSpanData(p *jobspb.InspectProgress, prog *jobspb.InspectProcessorProgress) {
      // future: aggregate row counts, etc.
}

pkg/sql/inspect/inspect_processor.go line 292 at r4 (raw file):

	for _, check := range checks {
		if check, ok := check.(inspectSpanCheck); ok {
			if err := check.RegisterChecksForSpan(checks); err != nil {

Why do we need to register each check beforehand? Could we not pass the check into CheckSpan and not do this pre-register at all?

bghal

@bghal made 2 comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @kev-cao, and @spilchen).

pkg/jobs/jobspb/jobs.go line 182 at r4 (raw file):

Previously, spilchen wrote…

This seems like a placeholder in this PR. I assume a follow-on PR will implement this, which I'm guessing it will extract out the row count. Does it need to be a first class function of InspectProgress? I'm not a huge fan of adding it to the jobspb package. That's typically reserved for general purpose helpers. This seems to specific to the inspect progress. Could we instead just add a function to handle this in progress.go? Something like:
func updateProgressWithSpanData(p *jobspb.InspectProgress, prog *jobspb.InspectProcessorProgress) {
      // future: aggregate row counts, etc.
}

Yep used in #159320.

Moved the helper to progress.go.

pkg/sql/inspect/inspect_processor.go line 292 at r4 (raw file):

Previously, spilchen wrote…

Why do we need to register each check beforehand? Could we not pass the check into CheckSpan and not do this pre-register at all?

It was used to validate the checks (e.g. ensure dependencies on other checks were satisfied). But can move that into the CheckSpan.

github-actions · 2026-01-06T23:21:33Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

spilchen

I had some comments on the changes made to the coordinator. I think this is starting to take a good shape though.

@spilchen partially reviewed 10 files and all commit messages, made 9 comments, and resolved 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).

pkg/sql/inspect/progress.go line 260 at r6 (raw file):

	inspectProgress.JobCompletedCheckCount += int64(count)

	// Update cached progress - the goroutine will handle actual persistence.

does the goroutine have a chance to flush the cache? This happens at the very end of inspect. I think the caller was doing a flush anyway. Probably makes sense to call the flush here instead.

pkg/sql/inspect/progress.go line 458 at r6 (raw file):

				if checker.IsSpanLevel() {
					totalApplicableChecks++

is this correct, doesn't this cause us to double count span level checks? If it is correct, we should have a comment here explaining why, since it seems odd.

pkg/sql/inspect/progress.go line 467 at r6 (raw file):

}

// countApplicableSpanChecks determines how many cluster-level checks apply.

nit: copy/paste error here, the function name in the comment doesn't match the actual function name.

pkg/sql/inspect/inspect_resumer.go line 86 at r6 (raw file):

		// After planning, we have the finalized set of spans to process (adjacent
		// spans on the same node are merged). Compute the checks to run and initialize
		// progress tracking from the plan.

So, we skip this part (initialize progress tracking) if we resume an inspect but all of the spans were done. Does this cause problems later in processClusterChecks because the progress cache isn't setup? Do we have a test that pauses (then resumes) the inspect job after completing the last span but before we have done the cluster level check?

pkg/sql/inspect/inspect_resumer.go line 448 at r6 (raw file):

	// Get the inspect progress
	inspectProgress, ok := progressTracker.job.Progress().Details.(*jobspb.Progress_Inspect)

should we be getting the cached progress from the tracker rather than the one from storage? This may work since I think the final progress causes an immediate write to storage. But I would think it's more efficient to use the cache anyway. Or maybe it's done this way to handle restarts?

pkg/sql/inspect/inspect_resumer.go line 464 at r6 (raw file):

	}()

	// Run each cluster check

This runner loop is similar to the one in pkg/sql/inspect/runner.go. Except this one works on cluster level checks. I think it would make sense if the bulk of this function was moved to a separate file (cluster_runner.go?) to keep the same level of abstraction. It would allow you to create unit tests to drive this logic too without having to standup a full inspect job instance.

pkg/sql/inspect/inspect_resumer.go line 479 at r6 (raw file):

					return errors.Wrapf(err, "error running cluster inspect check")
				}
				err = loggerBundle.logIssue(ctx, issue)

what if we didn't find an issue? i.e. issue is nil

pkg/sql/inspect/inspect_resumer.go line 495 at r6 (raw file):

	log.Dev.Infof(ctx, "INSPECT resumer completed issuesFound=%t", loggerBundle.hasIssues())
	if loggerBundle.hasIssues() {

Can this have issues that were found prior to processing the cluster level checks? If so, I don't think we should be checking for the issues here. It's not the correct layer of abstraction. I think it would make more sense to let the caller deal with this.

Previously the checks were independent and could only process individual spans. The row count check will be dependent on other checks that perform the scan of each span and will need to aggregate the row counts after all the spans have been processed. This change sets up interfaces for span-level (dependent on other checks running on the spans) and cluster-level checks (run after all spans have been processed). Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None

blathers-crl · 2026-01-08T19:50:33Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

bghal

@bghal made 7 comments and resolved 1 discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @kev-cao, and @spilchen).

pkg/sql/inspect/inspect_resumer.go line 86 at r6 (raw file):

Previously, spilchen wrote…

So, we skip this part (initialize progress tracking) if we resume an inspect but all of the spans were done. Does this cause problems later in processClusterChecks because the progress cache isn't setup? Do we have a test that pauses (then resumes) the inspect job after completing the last span but before we have done the cluster level check?

Initialized the progress tracker in both branches.

pkg/sql/inspect/inspect_resumer.go line 448 at r6 (raw file):

Previously, spilchen wrote…

should we be getting the cached progress from the tracker rather than the one from storage? This may work since I think the final progress causes an immediate write to storage. But I would think it's more efficient to use the cache anyway. Or maybe it's done this way to handle restarts?

Changed to use the cache for the progress.

pkg/sql/inspect/inspect_resumer.go line 464 at r6 (raw file):

Previously, spilchen wrote…

This runner loop is similar to the one in pkg/sql/inspect/runner.go. Except this one works on cluster level checks. I think it would make sense if the bulk of this function was moved to a separate file (cluster_runner.go?) to keep the same level of abstraction. It would allow you to create unit tests to drive this logic too without having to standup a full inspect job instance.

Did that and added tests.

pkg/sql/inspect/inspect_resumer.go line 479 at r6 (raw file):

Previously, spilchen wrote…

what if we didn't find an issue? i.e. issue is nil

Done.

pkg/sql/inspect/inspect_resumer.go line 495 at r6 (raw file):

Previously, spilchen wrote…

Can this have issues that were found prior to processing the cluster level checks? If so, I don't think we should be checking for the issues here. It's not the correct layer of abstraction. I think it would make more sense to let the caller deal with this.

No, the logger bundle is instantiated for the resumer (that's what #160095 was for). I refactored this into the inspectLoggerBundle since both the resumer and the processors used it.

pkg/sql/inspect/progress.go line 260 at r6 (raw file):

Previously, spilchen wrote…

does the goroutine have a chance to flush the cache? This happens at the very end of inspect. I think the caller was doing a flush anyway. Probably makes sense to call the flush here instead.

Moved the flush into the function.

pkg/sql/inspect/progress.go line 458 at r6 (raw file):

Previously, spilchen wrote…

is this correct, doesn't this cause us to double count span level checks? If it is correct, we should have a comment here explaining why, since it seems odd.

Yeah they're counted twice since they're run as both a regular check and after all the regular checks have run.

github-actions · 2026-01-08T19:54:31Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

bghal force-pushed the inspect-row-count branch 3 times, most recently from fb47508 to f45cca8 Compare December 17, 2025 20:14

bghal marked this pull request as ready for review December 18, 2025 18:12

bghal requested review from a team as code owners December 18, 2025 18:12

bghal requested review from kev-cao and removed request for a team December 18, 2025 18:12

dt reviewed Dec 18, 2025

View reviewed changes

pkg/sql/inspect/inspect_processor.go Outdated Show resolved Hide resolved

spilchen requested changes Dec 18, 2025

View reviewed changes

pkg/sql/inspect/inspect_processor.go Outdated Show resolved Hide resolved

bghal marked this pull request as draft December 23, 2025 19:43

bghal changed the title ~~sql: add support for meta-checks and post-checks~~ sql: introduce span-level checks and level-cluster checks Dec 23, 2025

bghal added the backport-26.1.x Flags PRs that need to be backported to 26.1 label Dec 31, 2025

bghal force-pushed the inspect-row-count branch from f45cca8 to a87b6de Compare December 31, 2025 02:28

bghal marked this pull request as ready for review December 31, 2025 13:51

bghal requested a review from spilchen December 31, 2025 13:51

bghal force-pushed the inspect-row-count branch from a87b6de to 91f097d Compare January 2, 2026 23:00

github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Jan 2, 2026

bghal added 2 commits January 6, 2026 13:15

sql: expose row counts from index consistency check

cdedd80

This change exposes the row counts computed during the index consistency check for use by other checks on the span. Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None

sql: use ExecutorConfig in inspect logger

65de6b4

The inspect resumer needs to be able to log issues. The logger can use the `ExecutorConfig` embedded in the context without changing its behavior. Part of: cockroachdb#155472 Epic: CRDB-55075 Release note: None

bghal force-pushed the inspect-row-count branch from 91f097d to 117bfad Compare January 6, 2026 19:34

spilchen reviewed Jan 6, 2026

View reviewed changes

bghal commented Jan 6, 2026

View reviewed changes

bghal force-pushed the inspect-row-count branch from 117bfad to 7966099 Compare January 6, 2026 23:16

spilchen requested changes Jan 7, 2026

View reviewed changes

bghal force-pushed the inspect-row-count branch from 7966099 to 72404cb Compare January 8, 2026 19:50

bghal commented Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sql: introduce span-level checks and level-cluster checks #159319

sql: introduce span-level checks and level-cluster checks #159319

bghal commented Dec 11, 2025 •

edited

Loading

Uh oh!

blathers-crl bot commented Dec 11, 2025

Uh oh!

cockroach-teamcity commented Dec 11, 2025

Uh oh!

Uh oh!

spilchen left a comment

Uh oh!

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

spilchen left a comment

Uh oh!

bghal left a comment

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

spilchen left a comment

Uh oh!

blathers-crl bot commented Jan 8, 2026

Uh oh!

bghal left a comment

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sql: introduce span-level checks and level-cluster checks #159319

Are you sure you want to change the base?

sql: introduce span-level checks and level-cluster checks #159319

Conversation

bghal commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Dec 11, 2025

Uh oh!

cockroach-teamcity commented Dec 11, 2025

Uh oh!

Uh oh!

spilchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 2, 2026

Potential Bug(s) Detected

Uh oh!

spilchen left a comment

Choose a reason for hiding this comment

Uh oh!

bghal left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2026

Potential Bug(s) Detected

Uh oh!

spilchen left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Jan 8, 2026

Uh oh!

bghal left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2026

Potential Bug(s) Detected

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bghal commented Dec 11, 2025 •

edited

Loading