Skip to content

Conversation

@bghal
Copy link
Contributor

@bghal bghal commented Dec 11, 2025

Previously the checks were independent and could only process individual
spans.
The row count check will be dependent on other checks that perform the
scan of each span and will need to aggregate the row counts after all
the spans have been processed.
This change sets up interfaces for span-level (dependent on other
checks running on the spans) and cluster-level checks (run after all spans have
been processed).

Part of: #155472
Epic: CRDB-55075

Release note: None

@blathers-crl
Copy link

blathers-crl bot commented Dec 11, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bghal bghal force-pushed the inspect-row-count branch 3 times, most recently from fb47508 to f45cca8 Compare December 17, 2025 20:14
@bghal bghal marked this pull request as ready for review December 18, 2025 18:12
@bghal bghal requested review from a team as code owners December 18, 2025 18:12
@bghal bghal requested review from kev-cao and removed request for a team December 18, 2025 18:12
Copy link
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilchen made 2 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).


pkg/sql/inspect/inspect_processor.go line 212 at r1 (raw file):

	// Run any post-checks after all spans have been processed.
	if err := p.processPostChecks(ctx); err != nil {

The handling has to happen at the coordinator. We don't have a cluster-wide view if we try to process the meta checks inside the per-node inspect processor. inspect_processor.go runs on each node, so it can’t do cluster-level row count validation. That has to come from the coordinator.

The idea was to piggyback on the existing check (index_consistency_check.go). That check already gathers row counts for the hash step (see buildIndexHashQuery). Right now those counts are thrown away after span processing. Instead, we should accumulate them and send them back as part of the progress stream. That means adding some state to InspectProcessorProgress.

We can do this unconditionally whenever index_consistency_check.go runs. The coordinator then decides what to do: if row count validation is enabled, it processes the aggregated counts; if not, it just ignores them. We can probably still introduce a new type of check for this (clusterChecks?) to keep the code clean.

@bghal bghal marked this pull request as draft December 23, 2025 19:43
@bghal bghal changed the title sql: add support for meta-checks and post-checks sql: introduce span-level checks and level-cluster checks Dec 23, 2025
@bghal bghal added the backport-26.1.x Flags PRs that need to be backported to 26.1 label Dec 31, 2025
@bghal bghal force-pushed the inspect-row-count branch from f45cca8 to a87b6de Compare December 31, 2025 02:28
@bghal bghal marked this pull request as ready for review December 31, 2025 13:51
@bghal bghal requested a review from spilchen December 31, 2025 13:51
@bghal bghal force-pushed the inspect-row-count branch from a87b6de to 91f097d Compare January 2, 2026 23:00
@github-actions
Copy link

github-actions bot commented Jan 2, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Jan 2, 2026
bghal added 2 commits January 6, 2026 13:15
This change exposes the row counts computed during the index consistency
check for use by other checks on the span.

Part of: cockroachdb#155472
Epic: CRDB-55075

Release note: None
The inspect resumer needs to be able to log issues. The logger can use
the `ExecutorConfig` embedded in the context without changing its
behavior.

Part of: cockroachdb#155472
Epic: CRDB-55075

Release note: None
@bghal bghal force-pushed the inspect-row-count branch from 91f097d to 117bfad Compare January 6, 2026 19:34
Copy link
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look. Although, not sure if this is ready for another review.

@spilchen partially reviewed 3 files, made 3 comments, and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).


pkg/jobs/jobspb/jobs.go line 182 at r4 (raw file):

// UpdateWithSpanProgress updates the InspectProgress with data from the
// check-specific fields in the span progress update.
func (p *InspectProgress) UpdateWithSpanProgress(prog *InspectProcessorProgress) {

This seems like a placeholder in this PR. I assume a follow-on PR will implement this, which I'm guessing it will extract out the row count. Does it need to be a first class function of InspectProgress? I'm not a huge fan of adding it to the jobspb package. That's typically reserved for general purpose helpers. This seems to specific to the inspect progress. Could we instead just add a function to handle this in progress.go? Something like:

func updateProgressWithSpanData(p *jobspb.InspectProgress, prog *jobspb.InspectProcessorProgress) {
      // future: aggregate row counts, etc.
}

pkg/sql/inspect/inspect_processor.go line 292 at r4 (raw file):

	for _, check := range checks {
		if check, ok := check.(inspectSpanCheck); ok {
			if err := check.RegisterChecksForSpan(checks); err != nil {

Why do we need to register each check beforehand? Could we not pass the check into CheckSpan and not do this pre-register at all?

Copy link
Contributor Author

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bghal made 2 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @kev-cao, and @spilchen).


pkg/jobs/jobspb/jobs.go line 182 at r4 (raw file):

Previously, spilchen wrote…

This seems like a placeholder in this PR. I assume a follow-on PR will implement this, which I'm guessing it will extract out the row count. Does it need to be a first class function of InspectProgress? I'm not a huge fan of adding it to the jobspb package. That's typically reserved for general purpose helpers. This seems to specific to the inspect progress. Could we instead just add a function to handle this in progress.go? Something like:

func updateProgressWithSpanData(p *jobspb.InspectProgress, prog *jobspb.InspectProcessorProgress) {
      // future: aggregate row counts, etc.
}

Yep used in #159320.

Moved the helper to progress.go.


pkg/sql/inspect/inspect_processor.go line 292 at r4 (raw file):

Previously, spilchen wrote…

Why do we need to register each check beforehand? Could we not pass the check into CheckSpan and not do this pre-register at all?

It was used to validate the checks (e.g. ensure dependencies on other checks were satisfied). But can move that into the CheckSpan.

@bghal bghal force-pushed the inspect-row-count branch from 117bfad to 7966099 Compare January 6, 2026 23:16
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

Copy link
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments on the changes made to the coordinator. I think this is starting to take a good shape though.

@spilchen partially reviewed 10 files and all commit messages, made 9 comments, and resolved 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bghal, @dt, and @kev-cao).


pkg/sql/inspect/progress.go line 260 at r6 (raw file):

	inspectProgress.JobCompletedCheckCount += int64(count)

	// Update cached progress - the goroutine will handle actual persistence.

does the goroutine have a chance to flush the cache? This happens at the very end of inspect. I think the caller was doing a flush anyway. Probably makes sense to call the flush here instead.


pkg/sql/inspect/progress.go line 458 at r6 (raw file):

				if checker.IsSpanLevel() {
					totalApplicableChecks++

is this correct, doesn't this cause us to double count span level checks? If it is correct, we should have a comment here explaining why, since it seems odd.


pkg/sql/inspect/progress.go line 467 at r6 (raw file):

}

// countApplicableSpanChecks determines how many cluster-level checks apply.

nit: copy/paste error here, the function name in the comment doesn't match the actual function name.


pkg/sql/inspect/inspect_resumer.go line 86 at r6 (raw file):

		// After planning, we have the finalized set of spans to process (adjacent
		// spans on the same node are merged). Compute the checks to run and initialize
		// progress tracking from the plan.

So, we skip this part (initialize progress tracking) if we resume an inspect but all of the spans were done. Does this cause problems later in processClusterChecks because the progress cache isn't setup? Do we have a test that pauses (then resumes) the inspect job after completing the last span but before we have done the cluster level check?


pkg/sql/inspect/inspect_resumer.go line 448 at r6 (raw file):

	// Get the inspect progress
	inspectProgress, ok := progressTracker.job.Progress().Details.(*jobspb.Progress_Inspect)

should we be getting the cached progress from the tracker rather than the one from storage? This may work since I think the final progress causes an immediate write to storage. But I would think it's more efficient to use the cache anyway. Or maybe it's done this way to handle restarts?


pkg/sql/inspect/inspect_resumer.go line 464 at r6 (raw file):

	}()

	// Run each cluster check

This runner loop is similar to the one in pkg/sql/inspect/runner.go. Except this one works on cluster level checks. I think it would make sense if the bulk of this function was moved to a separate file (cluster_runner.go?) to keep the same level of abstraction. It would allow you to create unit tests to drive this logic too without having to standup a full inspect job instance.


pkg/sql/inspect/inspect_resumer.go line 479 at r6 (raw file):

					return errors.Wrapf(err, "error running cluster inspect check")
				}
				err = loggerBundle.logIssue(ctx, issue)

what if we didn't find an issue? i.e. issue is nil


pkg/sql/inspect/inspect_resumer.go line 495 at r6 (raw file):

	log.Dev.Infof(ctx, "INSPECT resumer completed issuesFound=%t", loggerBundle.hasIssues())
	if loggerBundle.hasIssues() {

Can this have issues that were found prior to processing the cluster level checks? If so, I don't think we should be checking for the issues here. It's not the correct layer of abstraction. I think it would make more sense to let the caller deal with this.

Previously the checks were independent and could only process individual
spans.
The row count check will be dependent on other checks that perform the
scan of each span and will need to aggregate the row counts after all
the spans have been processed.
This change sets up interfaces for span-level (dependent on other
checks running on the spans) and cluster-level checks (run after all spans have
been processed).

Part of: cockroachdb#155472
Epic: CRDB-55075

Release note: None
@bghal bghal force-pushed the inspect-row-count branch from 7966099 to 72404cb Compare January 8, 2026 19:50
@blathers-crl
Copy link

blathers-crl bot commented Jan 8, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Contributor Author

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bghal made 7 comments and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @kev-cao, and @spilchen).


pkg/sql/inspect/inspect_resumer.go line 86 at r6 (raw file):

Previously, spilchen wrote…

So, we skip this part (initialize progress tracking) if we resume an inspect but all of the spans were done. Does this cause problems later in processClusterChecks because the progress cache isn't setup? Do we have a test that pauses (then resumes) the inspect job after completing the last span but before we have done the cluster level check?

Initialized the progress tracker in both branches.


pkg/sql/inspect/inspect_resumer.go line 448 at r6 (raw file):

Previously, spilchen wrote…

should we be getting the cached progress from the tracker rather than the one from storage? This may work since I think the final progress causes an immediate write to storage. But I would think it's more efficient to use the cache anyway. Or maybe it's done this way to handle restarts?

Changed to use the cache for the progress.


pkg/sql/inspect/inspect_resumer.go line 464 at r6 (raw file):

Previously, spilchen wrote…

This runner loop is similar to the one in pkg/sql/inspect/runner.go. Except this one works on cluster level checks. I think it would make sense if the bulk of this function was moved to a separate file (cluster_runner.go?) to keep the same level of abstraction. It would allow you to create unit tests to drive this logic too without having to standup a full inspect job instance.

Did that and added tests.


pkg/sql/inspect/inspect_resumer.go line 479 at r6 (raw file):

Previously, spilchen wrote…

what if we didn't find an issue? i.e. issue is nil

Done.


pkg/sql/inspect/inspect_resumer.go line 495 at r6 (raw file):

Previously, spilchen wrote…

Can this have issues that were found prior to processing the cluster level checks? If so, I don't think we should be checking for the issues here. It's not the correct layer of abstraction. I think it would make more sense to let the caller deal with this.

No, the logger bundle is instantiated for the resumer (that's what #160095 was for). I refactored this into the inspectLoggerBundle since both the resumer and the processors used it.


pkg/sql/inspect/progress.go line 260 at r6 (raw file):

Previously, spilchen wrote…

does the goroutine have a chance to flush the cache? This happens at the very end of inspect. I think the caller was doing a flush anyway. Probably makes sense to call the flush here instead.

Moved the flush into the function.


pkg/sql/inspect/progress.go line 458 at r6 (raw file):

Previously, spilchen wrote…

is this correct, doesn't this cause us to double count span level checks? If it is correct, we should have a comment here explaining why, since it seems odd.

Yeah they're counted twice since they're run as both a regular check and after all the regular checks have run.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-26.1.x Flags PRs that need to be backported to 26.1 o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants