Skip to content

RHOAIENG-49727 | feat: Adding e2e test circuit breaker#3215

Open
den-rgb wants to merge 1 commit intoopendatahub-io:mainfrom
den-rgb:RHOAIENG-49727
Open

RHOAIENG-49727 | feat: Adding e2e test circuit breaker#3215
den-rgb wants to merge 1 commit intoopendatahub-io:mainfrom
den-rgb:RHOAIENG-49727

Conversation

@den-rgb
Copy link
Contributor

@den-rgb den-rgb commented Mar 4, 2026

Description

Adds a circuit breaker that halts test execution early when infrastructure failures are detected, instead of letting every test timeout individually against a broken cluster.

The circuit breaker operates at three levels:
Pre-flight check - Before any tests run, TestOdhOperator runs a cluster health check. If the cluster is already unhealthy (operator down, nodes not ready), the breaker trips immediately and all suites are skipped. This catches the most common case: tests launched against a broken environment.
Sequential detection - Between test suites, mustRun records pass/fail results. After N consecutive failures (configurable, default 3), a health check runs. If the cluster is unhealthy, the breaker trips and remaining suites are skipped. If healthy, the counter resets (failures are test logic, not infrastructure). Skipped test groups and parallel tests don't record results to avoid false counter resets.

How Has This Been Tested?

#CircuitBreaker stops at preflight
kubectl scale deployment opendatahub-operator-controller-manager \
  -n opendatahub-operator-system --replicas=0

go test -v ./tests/e2e/... \
  --test-services=true \
  --test-service=monitoring \
  --test-components=false \
  --test-operator-controller=false \
  --deletion-policy=on-failure \
  -timeout=10m
#Sequential Test - Should Trigger during test
E2E_TEST_DEFAULTEVENTUALLYTIMEOUT=15s \
E2E_TEST_SHORTEVENTUALLYTIMEOUT=5s \
E2E_TEST_MEDIUMEVENTUALLYTIMEOUT=15s \
E2E_TEST_LONGEVENTUALLYTIMEOUT=15s \
go test -v ./tests/e2e/... \
  --test-services=true \
  --test-service=monitoring \
  --test-components=false \
  --test-operator-controller=false \
  --fail-fast-on-error=false \
  --circuit-breaker-threshold=1 \
  --deletion-policy=on-failure \
  -timeout=10m
  
  #In a different terminal
  sleep 25 && kubectl scale deployment opendatahub-operator-controller-manager \
  -n opendatahub-operator-system --replicas=0

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

Summary by CodeRabbit

  • Tests
    • Added a configurable circuit breaker for end-to-end tests (with CLI/env toggles) that performs preflight cluster health checks, auto-skips tests when tripped, and emits summary diagnostics
    • Added failure classification to distinguish infrastructure vs. test-logic failures
    • Improved test reporting and result-tracking to reduce noise during cluster issues

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an end-to-end circuit breaker and related test utilities to the e2e test suite. New components: a concurrency-safe CircuitBreaker (state, threshold, trip reasons, logging, skip helpers), a ClusterHealthChecker that wraps clusterhealth.Run and returns ClusterHealthStatus, and a FailureCategory classifier that scores failure output against infra/test-logic patterns. Integrates the breaker into the test runner with new flags (circuit-breaker, circuit-breaker-threshold), records test outcomes, skips tests when the breaker is open, and exposes diagnostics and force-trip APIs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Critical Issues

  • Race on post-health-check state update (CWE-362: Concurrent Modification). The breaker performs health checks outside the mutex and then "rechecks" state; ensure the recheck and any state mutation are done while holding the same mutex to prevent lost updates and TOCTOU. Action: acquire lock before applying state transitions after health check or use versioning/compare-and-swap under lock.

  • Missing/unclear context cancellation for cluster health run (CWE-400: Uncontrolled Resource Consumption / availability). Ensure ClusterHealthChecker.Check uses a context with timeout/cancel passed into clusterhealth.Run so hung health probes cannot block test shutdown. Action: create and pass a cancellable context with deadline and handle context errors explicitly.

  • Unbounded pattern-matching complexity in ClassifyFailure (CWE-400). The classifier counts matches over unbounded pattern lists and entire output; a large output or long pattern lists can cause high CPU work. Action: bound pattern list sizes, limit inspected output length, and short-circuit after decisive score thresholds.

  • No public reset/recovery API for CircuitBreaker (availability/operational). Only ForceTrip exists; there's no documented safe way to reset after infra fixes, forcing test suite restarts to recover. Action: add a Reset/Close method protected by policy checks and update tests/cleanup to call it when appropriate.

  • Parallel-test recording gap / missed failures (reliability). Parallel tests are excluded from deterministic recording; failures in parallel tests may be missed by the breaker. Action: document this explicitly and either (a) record results from parallel tests via a concurrency-safe aggregator, or (b) mark parallel runs as out-of-scope and ensure callers accept reduced coverage.

  • Trip reason string origin and sanitization (info disclosure/log injection). Trip reasons derived from external error messages may include untrusted content. Action: sanitize or normalize trip-reason strings before emitting logs or including them in test-skip messages to avoid log-injection or leaked sensitive info.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing a circuit breaker feature for end-to-end tests to prevent cascading failures on degraded infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from cgoodfred and sefroberg March 4, 2026 11:17
@openshift-ci
Copy link

openshift-ci bot commented Mar 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lphiri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@den-rgb den-rgb force-pushed the RHOAIENG-49727 branch 2 times, most recently from 9197106 to 8b48984 Compare March 4, 2026 11:27
@den-rgb den-rgb force-pushed the RHOAIENG-49727 branch 5 times, most recently from e6dca1e to 4794521 Compare March 4, 2026 11:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
pkg/clusterhealth/operator.go (1)

94-97: Prefer using the deployment’s selector over hardcoded label pairs.

Category: Suggestions. This avoids selector drift across operator variants and keeps pod discovery aligned with the actual deployment spec.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/clusterhealth/operator.go` around lines 94 - 97, Replace the hardcoded
labelSelectors slice (labelSelectors := []map[string]string{...}) with logic
that retrieves the target Deployment's selector and uses that selector to
discover pods; specifically locate where labelSelectors is defined in
operator.go and instead fetch the Deployment resource (e.g., the
controller-manager or rhods-operator Deployment) and use its
.Spec.Selector.MatchLabels (or converted metav1.LabelSelector) as the label map
for pod listing so discovery follows the actual Deployment.Spec.Selector rather
than fixed label pairs.
tests/e2e/controller_test.go (1)

576-577: Make parallel explicit instead of deriving it from len(opts).

Category: Suggestions. len(opts) > 0 will misclassify future non-parallel options and can suppress result recording unintentionally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 576 - 577, The current code infers
parallelism with "parallel := len(opts) > 0", which is fragile; change to use an
explicit boolean flag instead (e.g., add or read a field like opts.Parallel or
accept an explicit "parallel" parameter) so parallel is not tied to presence of
other options; update the variable assignment where "parallel" is declared to
read that explicit flag and adjust any call sites that construct "opts" to set
the flag, keeping "testExecuted" behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/clusterhealth/nodes.go`:
- Around line 46-53: The code currently overwrites info.UnhealthyReason for each
failing condition; update the logic in the node condition processing (the
branches that build "reason" using node.Name, cond.Type and cond.Message) to
preserve all reasons by appending new reasons instead of replacing them—e.g., if
info.UnhealthyReason is empty set it to the first reason, otherwise concatenate
the new reason with a clear separator (comma, semicolon, or newline). Make this
change in both the NodeReady branch and the problemConditions branch so
info.UnhealthyReason accumulates every reason while keeping the existing issues
= append(issues, reason) behavior.

In `@pkg/clusterhealth/operator.go`:
- Around line 102-108: The loop over labelSelectors currently swallows errors
from c.List (in operator.go) which can hide pod/container failures; update the
logic in the loop that calls c.List(ctx, pods, client.InNamespace(namespace),
client.MatchingLabels(labels)) to surface failures instead of continuing
silently—either aggregate the error and mark the operator/health check as
degraded, log the error with context (including the selector and namespace), or
return the error up to the caller so the operator health reflects the failure;
modify the pod-list handling around c.List, pods (*corev1.PodList) and the
surrounding health reporting code to ensure any c.List error becomes a visible
health issue rather than being ignored.

In `@tests/e2e/circuit_breaker_test.go`:
- Around line 104-105: The test calls cb.healthChecker.Check() without verifying
healthChecker is non-nil; guard against a nil checker in the test by checking
cb.healthChecker != nil before invoking Check() (or by initializing
cb.healthChecker with a mock before the call). Update the test around the call
to cb.healthChecker.Check() to either assert cb.healthChecker is not nil and
then call Check(), or set cb.healthChecker to a valid HealthChecker
implementation so cb.healthChecker.Check() cannot panic.

In `@tests/e2e/cluster_health_test.go`:
- Around line 92-96: The test currently calls clusterhealth.Run with
context.TODO(), which can hang; change it to use a bounded context created via
context.WithTimeout (or WithDeadline) and pass that ctx into clusterhealth.Run,
e.g., create ctx, cancel := context.WithTimeout(context.Background(),
<reasonable timeout like 10s>); defer cancel(); then call clusterhealth.Run(ctx,
cfg). Ensure the cancel is deferred so the timer is cleaned up and keep the
existing error handling (setting status.Healthy=false and appending the issue)
unchanged.

---

Nitpick comments:
In `@pkg/clusterhealth/operator.go`:
- Around line 94-97: Replace the hardcoded labelSelectors slice (labelSelectors
:= []map[string]string{...}) with logic that retrieves the target Deployment's
selector and uses that selector to discover pods; specifically locate where
labelSelectors is defined in operator.go and instead fetch the Deployment
resource (e.g., the controller-manager or rhods-operator Deployment) and use its
.Spec.Selector.MatchLabels (or converted metav1.LabelSelector) as the label map
for pod listing so discovery follows the actual Deployment.Spec.Selector rather
than fixed label pairs.

In `@tests/e2e/controller_test.go`:
- Around line 576-577: The current code infers parallelism with "parallel :=
len(opts) > 0", which is fragile; change to use an explicit boolean flag instead
(e.g., add or read a field like opts.Parallel or accept an explicit "parallel"
parameter) so parallel is not tied to presence of other options; update the
variable assignment where "parallel" is declared to read that explicit flag and
adjust any call sites that construct "opts" to set the flag, keeping
"testExecuted" behavior unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f6ac9cc2-804d-4fbf-8da0-fe3f46bba367

📥 Commits

Reviewing files that changed from the base of the PR and between 4702b77 and d02de1a.

📒 Files selected for processing (7)
  • pkg/clusterhealth/nodes.go
  • pkg/clusterhealth/operator.go
  • tests/e2e/circuit_breaker_test.go
  • tests/e2e/cluster_health_test.go
  • tests/e2e/controller_test.go
  • tests/e2e/failure_classifier_test.go
  • tests/e2e/helper_test.go

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.65%. Comparing base (4702b77) to head (e8dc0fd).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3215   +/-   ##
=======================================
  Coverage   49.65%   49.65%           
=======================================
  Files         165      165           
  Lines       12314    12314           
=======================================
  Hits         6115     6115           
  Misses       5551     5551           
  Partials      648      648           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/e2e/controller_test.go (1)

576-576: Category: Code Quality — Fragile parallel detection assumes any option implies parallel.

parallel := len(opts) > 0 assumes all options mean parallel execution. If a non-parallel option is added in the future, this will incorrectly skip recording results.

Consider explicitly checking for the parallel option:

Suggested improvement
-	parallel := len(opts) > 0
+	parallel := false
+	for _, opt := range opts {
+		// Check if this is the parallel option by comparing function pointers
+		// or use a dedicated flag in TestCaseOpts
+	}

Alternatively, add a comment clarifying the assumption, or refactor TestCaseOpts to include a flag indicating parallel execution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` at line 576, The current parallel detection
uses a fragile heuristic "parallel := len(opts) > 0" which treats any option as
meaning parallel; update detection to explicitly check for the parallel option
(e.g., search opts for a concrete Parallel/RunInParallel value or add and use a
bool field on TestCaseOpts such as TestCaseOpts.Parallel) and use that explicit
flag instead of len(opts) > 0 (or, if you prefer a minimal change, add a clear
comment above the "parallel" declaration documenting the assumption); refer to
the local variable "parallel", the "opts" slice, and the TestCaseOpts type when
making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/e2e/controller_test.go`:
- Line 576: The current parallel detection uses a fragile heuristic "parallel :=
len(opts) > 0" which treats any option as meaning parallel; update detection to
explicitly check for the parallel option (e.g., search opts for a concrete
Parallel/RunInParallel value or add and use a bool field on TestCaseOpts such as
TestCaseOpts.Parallel) and use that explicit flag instead of len(opts) > 0 (or,
if you prefer a minimal change, add a clear comment above the "parallel"
declaration documenting the assumption); refer to the local variable "parallel",
the "opts" slice, and the TestCaseOpts type when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 407b6dde-ebe6-45a0-99dc-ad5b226417dc

📥 Commits

Reviewing files that changed from the base of the PR and between d02de1a and e8dc0fd.

📒 Files selected for processing (7)
  • pkg/clusterhealth/nodes.go
  • pkg/clusterhealth/operator.go
  • tests/e2e/circuit_breaker_test.go
  • tests/e2e/cluster_health_test.go
  • tests/e2e/controller_test.go
  • tests/e2e/failure_classifier_test.go
  • tests/e2e/helper_test.go

@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 5, 2026

/test opendatahub-operator-rhoai-e2e

1 similar comment
@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 5, 2026

/test opendatahub-operator-rhoai-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 5, 2026

/retest

1 similar comment
@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 6, 2026

/retest

@StevenTobin
Copy link
Contributor

@den-rgb needs rebase to bring in #3216

@den-rgb den-rgb force-pushed the RHOAIENG-49727 branch 2 times, most recently from e87b7dd to 9a5b339 Compare March 9, 2026 15:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/controller_test.go`:
- Around line 309-314: The current t.Cleanup closure calls t.Fatalf after
handleCleanup's later-registered cleanup has already run (LIFO), so trips bypass
cleanup; move the circuit-breaker failure detection out of this late cleanup and
run it earlier (register this check with t.Cleanup or invoke it directly before
handleCleanup is registered) so it sets t.Fail/t.Errorf before handleCleanup
examines t.Failed(), or alternatively modify handleCleanup to treat
circuitBreaker.TotalTrips() > 0 (using circuitBreaker.TotalTrips() and
circuitBreaker.TripReason()) as a failure condition so cleanup paths under
DeletionPolicyOnFailure/DeletionPolicyAlways run correctly.
- Around line 316-321: The pre-flight failure handling in TestOdhOperator calls
circuitBreaker.ForceTrip(...) but then continues executing (reaching
CleanupPreviousTestResources(t)); modify the flow so that after calling
circuitBreaker.ForceTrip(...) when preflight.Healthy is false you return
immediately from TestOdhOperator to avoid making further cluster calls; locate
the healthChecker.Check() call that produces preflight and the subsequent
circuitBreaker.ForceTrip(...) invocation and add an immediate return (or
equivalent short-circuit) right after ForceTrip so
CleanupPreviousTestResources(t) and later mustRun guards are not reached.
- Around line 576-599: The current logic relies on the outer `passed :=
t.Run(name, ...)` return which is unreliable for parallel subtests; move failure
handling into the subtest's closure by adding a defer inside the anonymous test
function that checks t.Failed() and calls HandleTestFailure(name) (and emits the
same t.Logf("Stopping: %s test failed.") and t.Fail() behavior), and then use
the final t.Failed() state (or compute result := !t.Failed()) to call
circuitBreaker.RecordResult rather than the outer `passed` value; update the
closure to run defer HandleGlobalPanic() and then a defer that if t.Failed()
invokes HandleTestFailure(name) so parallel test failures are captured
correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8db397f4-223c-4506-a3db-805fe5123497

📥 Commits

Reviewing files that changed from the base of the PR and between e8dc0fd and e87b7dd.

📒 Files selected for processing (6)
  • pkg/clusterhealth/operator.go
  • tests/e2e/circuit_breaker_test.go
  • tests/e2e/cluster_health_test.go
  • tests/e2e/controller_test.go
  • tests/e2e/failure_classifier_test.go
  • tests/e2e/helper_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/clusterhealth/operator.go
  • tests/e2e/cluster_health_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
tests/e2e/controller_test.go (3)

316-322: ⚠️ Potential issue | 🟠 Major

Missing return after preflight ForceTrip causes continued execution.

When preflight fails and ForceTrip is called, execution continues to CleanupPreviousTestResources(t) at line 326 and subsequent mustRun calls. This undermines the circuit breaker's purpose of halting early on infrastructure failures.

 		preflight := healthChecker.Check()
 		if !preflight.Healthy {
 			circuitBreaker.ForceTrip(fmt.Sprintf(
 				"Pre-flight health check failed: [%s]",
 				strings.Join(preflight.Issues, "; ")))
+			return
 		}
 	}

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 316 - 322, The preflight failure
branch calls circuitBreaker.ForceTrip(...) but does not stop further execution;
update the block handling healthChecker.Check() so that after calling
circuitBreaker.ForceTrip(...) you immediately return from the test function to
prevent calling CleanupPreviousTestResources(t) and subsequent mustRun calls.
Locate the healthChecker.Check() call and its !preflight.Healthy branch and add
an early return right after circuitBreaker.ForceTrip(...) to enforce the circuit
breaker behavior.

309-314: ⚠️ Potential issue | 🟠 Major

Cleanup LIFO ordering causes breaker trips to bypass resource cleanup.

The t.Cleanup registered here runs after handleCleanup's cleanup (LIFO). When the breaker trips, handleCleanup checks t.Failed() which is still false, skips cleanup, then this cleanup calls t.Fatalf. Resources under DeletionPolicyOnFailure won't be cleaned.

Either register this cleanup after handleCleanup, or modify handleCleanup to also check circuitBreaker.TotalTrips() > 0.

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 309 - 314, The cleanup runs in
LIFO so the existing t.Cleanup that calls circuitBreaker.LogSummary() and
t.Fatalf can run after handleCleanup and prevent DeletionPolicyOnFailure from
running; modify handleCleanup to consider circuit breaker trips as a failure
condition by checking circuitBreaker.TotalTrips() > 0 (or a Tripped() helper) in
addition to t.Failed(), and ensure handleCleanup triggers the same resource
deletion/cleanup path when trips > 0 so resources under DeletionPolicyOnFailure
are cleaned even if t.Failed() is false.

576-596: ⚠️ Potential issue | 🟠 Major

Parallel subtests bypass failure diagnostics due to early t.Run return.

When WithParallel() is used, t.Run returns immediately after t.Parallel() is called inside the subtest. The passed value at line 591 doesn't reflect failures occurring after return. HandleTestFailure(name) at line 593 executes before the parallel subtest completes.

Move HandleTestFailure into a defer inside the subtest closure to capture failures during parallel execution:

 	passed := t.Run(name, func(t *testing.T) {
 		testExecuted = true
+		defer func() {
+			wasSkipped = t.Skipped()
+			if parallel && t.Failed() {
+				HandleTestFailure(name)
+			}
+		}()
 		for _, opt := range opts {
 			opt(t)
 		}
 		// Set up panic handler for each test group
 		defer HandleGlobalPanic()
-		defer func() { wasSkipped = t.Skipped() }()
 		testFunc(t)
 	})

-	if !passed {
+	if !parallel && !passed {
 		// Run diagnostics on test failure
 		HandleTestFailure(name)

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 576 - 596, The current outer check
using the returned value `passed` is unreliable for parallel subtests because
`t.Run` can return early; instead, inside the `t.Run(name, func(t *testing.T) {
... })` closure add a deferred function that runs after the subtest finishes
and, if `t.Failed()` is true, calls `HandleTestFailure(name)` and logs/fails as
needed; keep existing defers (`HandleGlobalPanic`, the skip capture) and the
`testFunc(t)` call, but remove or disable the outer `if !passed {
HandleTestFailure(name) ... }` block so diagnostics run from the deferred check
inside the subtest closure to catch failures from `WithParallel()` cases.
🧹 Nitpick comments (2)
tests/e2e/circuit_breaker_test.go (1)

194-209: TOCTOU in LogSummary between TotalTrips() and TripReason() calls.

Separate lock acquisitions at lines 199 and 207 allow state to change between calls. For logging consistency, acquire lock once:

 func (cb *CircuitBreaker) LogSummary() {
 	if cb == nil {
 		return
 	}
-	trips := cb.TotalTrips()
+	cb.mu.Lock()
+	trips := cb.totalTrips
+	reason := cb.tripReason
+	threshold := cb.threshold
+	cb.mu.Unlock()
+
 	if trips == 0 {
-		log.Printf("Circuit breaker: no trips during this run (threshold=%d)", cb.threshold)
+		log.Printf("Circuit breaker: no trips during this run (threshold=%d)", threshold)
 		return
 	}

 	log.Printf("=== CIRCUIT BREAKER SUMMARY ===")
 	log.Printf("Tripped %d time(s)", trips)
-	log.Printf("Reason: %s", cb.TripReason())
+	log.Printf("Reason: %s", reason)
 	log.Printf("===============================")
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/circuit_breaker_test.go` around lines 194 - 209, LogSummary
performs separate state reads via TotalTrips() and TripReason(), allowing TOCTOU
changes between calls; modify CircuitBreaker.LogSummary to acquire the circuit
breaker's mutex once around reading both TotalTrips and TripReason (or directly
read the underlying counters/reason fields while holding cb.mu) so the logged
trips count and reason reflect the same consistent snapshot; use the same mutex
name used in the type (e.g., cb.mu) and call the accessor methods or read fields
under that single lock before emitting the log lines.
tests/e2e/cluster_health_test.go (1)

107-114: Issue collection hardcodes section names; consider aligning with OnlySections.

If OnlySections is extended to include additional sections (e.g., SectionPods), this loop won't capture their errors. Consider iterating based on the report structure or extracting section names to a shared constant.

-	for _, s := range []struct{ name, err string }{
-		{"nodes", report.Nodes.Error},
-		{"operator", report.Operator.Error},
-	} {
-		if s.err != "" {
-			status.Issues = append(status.Issues, s.err)
-		}
-	}
+	if report.Nodes.Error != "" {
+		status.Issues = append(status.Issues, report.Nodes.Error)
+	}
+	if report.Operator.Error != "" {
+		status.Issues = append(status.Issues, report.Operator.Error)
+	}

The simpler form removes unused name field and makes the sections explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/cluster_health_test.go` around lines 107 - 114, The loop that
builds status.Issues currently hardcodes sections via a local struct with unused
name and will miss any new sections (e.g., SectionPods) added to OnlySections;
update the logic to iterate over the canonical list of report sections (use
OnlySections or a shared slice/map of section keys) and for each section read
the corresponding report.<Section>.Error (e.g., report.Nodes.Error,
report.Operator.Error, report.Pods.Error) or use a report.Errors map if
available, appending non-empty errors to status.Issues; remove the unused name
field and ensure the source of section names is the single shared
constant/variable so adding sections automatically includes their errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/e2e/controller_test.go`:
- Around line 316-322: The preflight failure branch calls
circuitBreaker.ForceTrip(...) but does not stop further execution; update the
block handling healthChecker.Check() so that after calling
circuitBreaker.ForceTrip(...) you immediately return from the test function to
prevent calling CleanupPreviousTestResources(t) and subsequent mustRun calls.
Locate the healthChecker.Check() call and its !preflight.Healthy branch and add
an early return right after circuitBreaker.ForceTrip(...) to enforce the circuit
breaker behavior.
- Around line 309-314: The cleanup runs in LIFO so the existing t.Cleanup that
calls circuitBreaker.LogSummary() and t.Fatalf can run after handleCleanup and
prevent DeletionPolicyOnFailure from running; modify handleCleanup to consider
circuit breaker trips as a failure condition by checking
circuitBreaker.TotalTrips() > 0 (or a Tripped() helper) in addition to
t.Failed(), and ensure handleCleanup triggers the same resource deletion/cleanup
path when trips > 0 so resources under DeletionPolicyOnFailure are cleaned even
if t.Failed() is false.
- Around line 576-596: The current outer check using the returned value `passed`
is unreliable for parallel subtests because `t.Run` can return early; instead,
inside the `t.Run(name, func(t *testing.T) { ... })` closure add a deferred
function that runs after the subtest finishes and, if `t.Failed()` is true,
calls `HandleTestFailure(name)` and logs/fails as needed; keep existing defers
(`HandleGlobalPanic`, the skip capture) and the `testFunc(t)` call, but remove
or disable the outer `if !passed { HandleTestFailure(name) ... }` block so
diagnostics run from the deferred check inside the subtest closure to catch
failures from `WithParallel()` cases.

---

Nitpick comments:
In `@tests/e2e/circuit_breaker_test.go`:
- Around line 194-209: LogSummary performs separate state reads via TotalTrips()
and TripReason(), allowing TOCTOU changes between calls; modify
CircuitBreaker.LogSummary to acquire the circuit breaker's mutex once around
reading both TotalTrips and TripReason (or directly read the underlying
counters/reason fields while holding cb.mu) so the logged trips count and reason
reflect the same consistent snapshot; use the same mutex name used in the type
(e.g., cb.mu) and call the accessor methods or read fields under that single
lock before emitting the log lines.

In `@tests/e2e/cluster_health_test.go`:
- Around line 107-114: The loop that builds status.Issues currently hardcodes
sections via a local struct with unused name and will miss any new sections
(e.g., SectionPods) added to OnlySections; update the logic to iterate over the
canonical list of report sections (use OnlySections or a shared slice/map of
section keys) and for each section read the corresponding report.<Section>.Error
(e.g., report.Nodes.Error, report.Operator.Error, report.Pods.Error) or use a
report.Errors map if available, appending non-empty errors to status.Issues;
remove the unused name field and ensure the source of section names is the
single shared constant/variable so adding sections automatically includes their
errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0cbd3de3-996d-436f-9b9c-9240684a9053

📥 Commits

Reviewing files that changed from the base of the PR and between e87b7dd and 9a5b339.

📒 Files selected for processing (5)
  • tests/e2e/circuit_breaker_test.go
  • tests/e2e/cluster_health_test.go
  • tests/e2e/controller_test.go
  • tests/e2e/failure_classifier_test.go
  • tests/e2e/helper_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e/failure_classifier_test.go

@den-rgb den-rgb force-pushed the RHOAIENG-49727 branch 3 times, most recently from 9842eda to f05e535 Compare March 10, 2026 11:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/controller_test.go`:
- Around line 599-600: Reorder the if-condition so it short-circuits on
!parallel before reading wasSkipped to avoid a data race; specifically update
the conditional that guards circuitBreaker.RecordResult(passed) (where variables
testExecuted, wasSkipped, parallel are used) to check !parallel first (e.g., if
testExecuted && !parallel && !wasSkipped) so the read of wasSkipped is skipped
when parallel is true.
- Around line 577-600: The code currently infers parallelization by checking
parallel := len(opts) > 0 which is unreliable for suites that internally call
RunTestCases(..., WithParallel()) (e.g., dscValidationTestSuite invoked via
mustRun), so replace this heuristic with an explicit "safe to record" signal:
add a new option/flag (e.g., WithSafeToRecord or SafeToRecord boolean) that can
be passed through opts and read here as safeToRecord (instead of parallel :=
len(opts) > 0), update callers of mustRun/RunTestCases and any suite functions
(like dscValidationTestSuite) to set that option when they guarantee no internal
parallel work remains at t.Run return, and finally use safeToRecord when
deciding whether to call circuitBreaker.RecordResult(passed); keep testExecuted
and wasSkipped logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 453e0638-6fad-4fbb-8756-a4f287078e8f

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5b339 and f05e535.

📒 Files selected for processing (5)
  • tests/e2e/circuit_breaker_test.go
  • tests/e2e/cluster_health_test.go
  • tests/e2e/controller_test.go
  • tests/e2e/failure_classifier_test.go
  • tests/e2e/helper_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e/failure_classifier_test.go

Comment on lines +599 to +600
if testExecuted && !wasSkipped && !parallel {
circuitBreaker.RecordResult(passed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and check its contents around the mentioned lines
fd controller_test.go | head -5

Repository: opendatahub-io/opendatahub-operator

Length of output: 435


🏁 Script executed:

# Once found, let's read the file to understand the context
wc -l tests/e2e/controller_test.go

Repository: opendatahub-io/opendatahub-operator

Length of output: 111


🏁 Script executed:

# Read the relevant section with context
sed -n '580,620p' tests/e2e/controller_test.go

Repository: opendatahub-io/opendatahub-operator

Length of output: 695


🏁 Script executed:

# Get more context to see the function structure and variable declarations
sed -n '550,609p' tests/e2e/controller_test.go

Repository: opendatahub-io/opendatahub-operator

Length of output: 1455


Reorder condition to short-circuit on !parallel before reading wasSkipped.

When a subtest calls t.Parallel(), t.Run() returns immediately before the test's deferred cleanup runs. Evaluating !wasSkipped while the test goroutine may be setting it via defer func() { wasSkipped = t.Skipped() }() creates a data race. Short-circuiting on !parallel prevents this read when parallel is true.

Suggested fix
-if testExecuted && !wasSkipped && !parallel {
+if testExecuted && !parallel && !wasSkipped {
 	circuitBreaker.RecordResult(passed)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if testExecuted && !wasSkipped && !parallel {
circuitBreaker.RecordResult(passed)
if testExecuted && !parallel && !wasSkipped {
circuitBreaker.RecordResult(passed)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 599 - 600, Reorder the
if-condition so it short-circuits on !parallel before reading wasSkipped to
avoid a data race; specifically update the conditional that guards
circuitBreaker.RecordResult(passed) (where variables testExecuted, wasSkipped,
parallel are used) to check !parallel first (e.g., if testExecuted && !parallel
&& !wasSkipped) so the read of wasSkipped is skipped when parallel is true.

@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 10, 2026

/retest

@StevenTobin
Copy link
Contributor

The classification piece you added here will duplicate the intent of #3222 right?

@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 16, 2026

/test opendatahub-operator-rhoai-e2e

1 similar comment
@den-rgb
Copy link
Contributor Author

den-rgb commented Mar 16, 2026

/test opendatahub-operator-rhoai-e2e

@openshift-ci
Copy link

openshift-ci bot commented Mar 16, 2026

@den-rgb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-rhoai-e2e 24b8398 link true /test opendatahub-operator-rhoai-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants