RHOAIENG-49727 | feat: Adding e2e test circuit breaker#3215
RHOAIENG-49727 | feat: Adding e2e test circuit breaker#3215den-rgb wants to merge 1 commit intoopendatahub-io:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an end-to-end circuit breaker and related test utilities to the e2e test suite. New components: a concurrency-safe CircuitBreaker (state, threshold, trip reasons, logging, skip helpers), a ClusterHealthChecker that wraps clusterhealth.Run and returns ClusterHealthStatus, and a FailureCategory classifier that scores failure output against infra/test-logic patterns. Integrates the breaker into the test runner with new flags (circuit-breaker, circuit-breaker-threshold), records test outcomes, skips tests when the breaker is open, and exposes diagnostics and force-trip APIs. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Critical Issues
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
9197106 to
8b48984
Compare
e6dca1e to
4794521
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
pkg/clusterhealth/operator.go (1)
94-97: Prefer using the deployment’s selector over hardcoded label pairs.Category: Suggestions. This avoids selector drift across operator variants and keeps pod discovery aligned with the actual deployment spec.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/clusterhealth/operator.go` around lines 94 - 97, Replace the hardcoded labelSelectors slice (labelSelectors := []map[string]string{...}) with logic that retrieves the target Deployment's selector and uses that selector to discover pods; specifically locate where labelSelectors is defined in operator.go and instead fetch the Deployment resource (e.g., the controller-manager or rhods-operator Deployment) and use its .Spec.Selector.MatchLabels (or converted metav1.LabelSelector) as the label map for pod listing so discovery follows the actual Deployment.Spec.Selector rather than fixed label pairs.tests/e2e/controller_test.go (1)
576-577: Makeparallelexplicit instead of deriving it fromlen(opts).Category: Suggestions.
len(opts) > 0will misclassify future non-parallel options and can suppress result recording unintentionally.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/controller_test.go` around lines 576 - 577, The current code infers parallelism with "parallel := len(opts) > 0", which is fragile; change to use an explicit boolean flag instead (e.g., add or read a field like opts.Parallel or accept an explicit "parallel" parameter) so parallel is not tied to presence of other options; update the variable assignment where "parallel" is declared to read that explicit flag and adjust any call sites that construct "opts" to set the flag, keeping "testExecuted" behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/clusterhealth/nodes.go`:
- Around line 46-53: The code currently overwrites info.UnhealthyReason for each
failing condition; update the logic in the node condition processing (the
branches that build "reason" using node.Name, cond.Type and cond.Message) to
preserve all reasons by appending new reasons instead of replacing them—e.g., if
info.UnhealthyReason is empty set it to the first reason, otherwise concatenate
the new reason with a clear separator (comma, semicolon, or newline). Make this
change in both the NodeReady branch and the problemConditions branch so
info.UnhealthyReason accumulates every reason while keeping the existing issues
= append(issues, reason) behavior.
In `@pkg/clusterhealth/operator.go`:
- Around line 102-108: The loop over labelSelectors currently swallows errors
from c.List (in operator.go) which can hide pod/container failures; update the
logic in the loop that calls c.List(ctx, pods, client.InNamespace(namespace),
client.MatchingLabels(labels)) to surface failures instead of continuing
silently—either aggregate the error and mark the operator/health check as
degraded, log the error with context (including the selector and namespace), or
return the error up to the caller so the operator health reflects the failure;
modify the pod-list handling around c.List, pods (*corev1.PodList) and the
surrounding health reporting code to ensure any c.List error becomes a visible
health issue rather than being ignored.
In `@tests/e2e/circuit_breaker_test.go`:
- Around line 104-105: The test calls cb.healthChecker.Check() without verifying
healthChecker is non-nil; guard against a nil checker in the test by checking
cb.healthChecker != nil before invoking Check() (or by initializing
cb.healthChecker with a mock before the call). Update the test around the call
to cb.healthChecker.Check() to either assert cb.healthChecker is not nil and
then call Check(), or set cb.healthChecker to a valid HealthChecker
implementation so cb.healthChecker.Check() cannot panic.
In `@tests/e2e/cluster_health_test.go`:
- Around line 92-96: The test currently calls clusterhealth.Run with
context.TODO(), which can hang; change it to use a bounded context created via
context.WithTimeout (or WithDeadline) and pass that ctx into clusterhealth.Run,
e.g., create ctx, cancel := context.WithTimeout(context.Background(),
<reasonable timeout like 10s>); defer cancel(); then call clusterhealth.Run(ctx,
cfg). Ensure the cancel is deferred so the timer is cleaned up and keep the
existing error handling (setting status.Healthy=false and appending the issue)
unchanged.
---
Nitpick comments:
In `@pkg/clusterhealth/operator.go`:
- Around line 94-97: Replace the hardcoded labelSelectors slice (labelSelectors
:= []map[string]string{...}) with logic that retrieves the target Deployment's
selector and uses that selector to discover pods; specifically locate where
labelSelectors is defined in operator.go and instead fetch the Deployment
resource (e.g., the controller-manager or rhods-operator Deployment) and use its
.Spec.Selector.MatchLabels (or converted metav1.LabelSelector) as the label map
for pod listing so discovery follows the actual Deployment.Spec.Selector rather
than fixed label pairs.
In `@tests/e2e/controller_test.go`:
- Around line 576-577: The current code infers parallelism with "parallel :=
len(opts) > 0", which is fragile; change to use an explicit boolean flag instead
(e.g., add or read a field like opts.Parallel or accept an explicit "parallel"
parameter) so parallel is not tied to presence of other options; update the
variable assignment where "parallel" is declared to read that explicit flag and
adjust any call sites that construct "opts" to set the flag, keeping
"testExecuted" behavior unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f6ac9cc2-804d-4fbf-8da0-fe3f46bba367
📒 Files selected for processing (7)
pkg/clusterhealth/nodes.gopkg/clusterhealth/operator.gotests/e2e/circuit_breaker_test.gotests/e2e/cluster_health_test.gotests/e2e/controller_test.gotests/e2e/failure_classifier_test.gotests/e2e/helper_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3215 +/- ##
=======================================
Coverage 49.65% 49.65%
=======================================
Files 165 165
Lines 12314 12314
=======================================
Hits 6115 6115
Misses 5551 5551
Partials 648 648 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/e2e/controller_test.go (1)
576-576: Category: Code Quality — Fragile parallel detection assumes any option implies parallel.
parallel := len(opts) > 0assumes all options mean parallel execution. If a non-parallel option is added in the future, this will incorrectly skip recording results.Consider explicitly checking for the parallel option:
Suggested improvement
- parallel := len(opts) > 0 + parallel := false + for _, opt := range opts { + // Check if this is the parallel option by comparing function pointers + // or use a dedicated flag in TestCaseOpts + }Alternatively, add a comment clarifying the assumption, or refactor
TestCaseOptsto include a flag indicating parallel execution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/controller_test.go` at line 576, The current parallel detection uses a fragile heuristic "parallel := len(opts) > 0" which treats any option as meaning parallel; update detection to explicitly check for the parallel option (e.g., search opts for a concrete Parallel/RunInParallel value or add and use a bool field on TestCaseOpts such as TestCaseOpts.Parallel) and use that explicit flag instead of len(opts) > 0 (or, if you prefer a minimal change, add a clear comment above the "parallel" declaration documenting the assumption); refer to the local variable "parallel", the "opts" slice, and the TestCaseOpts type when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/e2e/controller_test.go`:
- Line 576: The current parallel detection uses a fragile heuristic "parallel :=
len(opts) > 0" which treats any option as meaning parallel; update detection to
explicitly check for the parallel option (e.g., search opts for a concrete
Parallel/RunInParallel value or add and use a bool field on TestCaseOpts such as
TestCaseOpts.Parallel) and use that explicit flag instead of len(opts) > 0 (or,
if you prefer a minimal change, add a clear comment above the "parallel"
declaration documenting the assumption); refer to the local variable "parallel",
the "opts" slice, and the TestCaseOpts type when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 407b6dde-ebe6-45a0-99dc-ad5b226417dc
📒 Files selected for processing (7)
pkg/clusterhealth/nodes.gopkg/clusterhealth/operator.gotests/e2e/circuit_breaker_test.gotests/e2e/cluster_health_test.gotests/e2e/controller_test.gotests/e2e/failure_classifier_test.gotests/e2e/helper_test.go
|
/test opendatahub-operator-rhoai-e2e |
1 similar comment
|
/test opendatahub-operator-rhoai-e2e |
|
/retest |
1 similar comment
|
/retest |
e87b7dd to
9a5b339
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/controller_test.go`:
- Around line 309-314: The current t.Cleanup closure calls t.Fatalf after
handleCleanup's later-registered cleanup has already run (LIFO), so trips bypass
cleanup; move the circuit-breaker failure detection out of this late cleanup and
run it earlier (register this check with t.Cleanup or invoke it directly before
handleCleanup is registered) so it sets t.Fail/t.Errorf before handleCleanup
examines t.Failed(), or alternatively modify handleCleanup to treat
circuitBreaker.TotalTrips() > 0 (using circuitBreaker.TotalTrips() and
circuitBreaker.TripReason()) as a failure condition so cleanup paths under
DeletionPolicyOnFailure/DeletionPolicyAlways run correctly.
- Around line 316-321: The pre-flight failure handling in TestOdhOperator calls
circuitBreaker.ForceTrip(...) but then continues executing (reaching
CleanupPreviousTestResources(t)); modify the flow so that after calling
circuitBreaker.ForceTrip(...) when preflight.Healthy is false you return
immediately from TestOdhOperator to avoid making further cluster calls; locate
the healthChecker.Check() call that produces preflight and the subsequent
circuitBreaker.ForceTrip(...) invocation and add an immediate return (or
equivalent short-circuit) right after ForceTrip so
CleanupPreviousTestResources(t) and later mustRun guards are not reached.
- Around line 576-599: The current logic relies on the outer `passed :=
t.Run(name, ...)` return which is unreliable for parallel subtests; move failure
handling into the subtest's closure by adding a defer inside the anonymous test
function that checks t.Failed() and calls HandleTestFailure(name) (and emits the
same t.Logf("Stopping: %s test failed.") and t.Fail() behavior), and then use
the final t.Failed() state (or compute result := !t.Failed()) to call
circuitBreaker.RecordResult rather than the outer `passed` value; update the
closure to run defer HandleGlobalPanic() and then a defer that if t.Failed()
invokes HandleTestFailure(name) so parallel test failures are captured
correctly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 8db397f4-223c-4506-a3db-805fe5123497
📒 Files selected for processing (6)
pkg/clusterhealth/operator.gotests/e2e/circuit_breaker_test.gotests/e2e/cluster_health_test.gotests/e2e/controller_test.gotests/e2e/failure_classifier_test.gotests/e2e/helper_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/clusterhealth/operator.go
- tests/e2e/cluster_health_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (3)
tests/e2e/controller_test.go (3)
316-322:⚠️ Potential issue | 🟠 MajorMissing return after preflight ForceTrip causes continued execution.
When preflight fails and
ForceTripis called, execution continues toCleanupPreviousTestResources(t)at line 326 and subsequentmustRuncalls. This undermines the circuit breaker's purpose of halting early on infrastructure failures.preflight := healthChecker.Check() if !preflight.Healthy { circuitBreaker.ForceTrip(fmt.Sprintf( "Pre-flight health check failed: [%s]", strings.Join(preflight.Issues, "; "))) + return } },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/controller_test.go` around lines 316 - 322, The preflight failure branch calls circuitBreaker.ForceTrip(...) but does not stop further execution; update the block handling healthChecker.Check() so that after calling circuitBreaker.ForceTrip(...) you immediately return from the test function to prevent calling CleanupPreviousTestResources(t) and subsequent mustRun calls. Locate the healthChecker.Check() call and its !preflight.Healthy branch and add an early return right after circuitBreaker.ForceTrip(...) to enforce the circuit breaker behavior.
309-314:⚠️ Potential issue | 🟠 MajorCleanup LIFO ordering causes breaker trips to bypass resource cleanup.
The
t.Cleanupregistered here runs afterhandleCleanup's cleanup (LIFO). When the breaker trips,handleCleanupcheckst.Failed()which is still false, skips cleanup, then this cleanup callst.Fatalf. Resources underDeletionPolicyOnFailurewon't be cleaned.Either register this cleanup after
handleCleanup, or modifyhandleCleanupto also checkcircuitBreaker.TotalTrips() > 0.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/controller_test.go` around lines 309 - 314, The cleanup runs in LIFO so the existing t.Cleanup that calls circuitBreaker.LogSummary() and t.Fatalf can run after handleCleanup and prevent DeletionPolicyOnFailure from running; modify handleCleanup to consider circuit breaker trips as a failure condition by checking circuitBreaker.TotalTrips() > 0 (or a Tripped() helper) in addition to t.Failed(), and ensure handleCleanup triggers the same resource deletion/cleanup path when trips > 0 so resources under DeletionPolicyOnFailure are cleaned even if t.Failed() is false.
576-596:⚠️ Potential issue | 🟠 MajorParallel subtests bypass failure diagnostics due to early
t.Runreturn.When
WithParallel()is used,t.Runreturns immediately aftert.Parallel()is called inside the subtest. Thepassedvalue at line 591 doesn't reflect failures occurring after return.HandleTestFailure(name)at line 593 executes before the parallel subtest completes.Move
HandleTestFailureinto a defer inside the subtest closure to capture failures during parallel execution:passed := t.Run(name, func(t *testing.T) { testExecuted = true + defer func() { + wasSkipped = t.Skipped() + if parallel && t.Failed() { + HandleTestFailure(name) + } + }() for _, opt := range opts { opt(t) } // Set up panic handler for each test group defer HandleGlobalPanic() - defer func() { wasSkipped = t.Skipped() }() testFunc(t) }) - if !passed { + if !parallel && !passed { // Run diagnostics on test failure HandleTestFailure(name),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/controller_test.go` around lines 576 - 596, The current outer check using the returned value `passed` is unreliable for parallel subtests because `t.Run` can return early; instead, inside the `t.Run(name, func(t *testing.T) { ... })` closure add a deferred function that runs after the subtest finishes and, if `t.Failed()` is true, calls `HandleTestFailure(name)` and logs/fails as needed; keep existing defers (`HandleGlobalPanic`, the skip capture) and the `testFunc(t)` call, but remove or disable the outer `if !passed { HandleTestFailure(name) ... }` block so diagnostics run from the deferred check inside the subtest closure to catch failures from `WithParallel()` cases.
🧹 Nitpick comments (2)
tests/e2e/circuit_breaker_test.go (1)
194-209: TOCTOU inLogSummarybetweenTotalTrips()andTripReason()calls.Separate lock acquisitions at lines 199 and 207 allow state to change between calls. For logging consistency, acquire lock once:
func (cb *CircuitBreaker) LogSummary() { if cb == nil { return } - trips := cb.TotalTrips() + cb.mu.Lock() + trips := cb.totalTrips + reason := cb.tripReason + threshold := cb.threshold + cb.mu.Unlock() + if trips == 0 { - log.Printf("Circuit breaker: no trips during this run (threshold=%d)", cb.threshold) + log.Printf("Circuit breaker: no trips during this run (threshold=%d)", threshold) return } log.Printf("=== CIRCUIT BREAKER SUMMARY ===") log.Printf("Tripped %d time(s)", trips) - log.Printf("Reason: %s", cb.TripReason()) + log.Printf("Reason: %s", reason) log.Printf("===============================") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/circuit_breaker_test.go` around lines 194 - 209, LogSummary performs separate state reads via TotalTrips() and TripReason(), allowing TOCTOU changes between calls; modify CircuitBreaker.LogSummary to acquire the circuit breaker's mutex once around reading both TotalTrips and TripReason (or directly read the underlying counters/reason fields while holding cb.mu) so the logged trips count and reason reflect the same consistent snapshot; use the same mutex name used in the type (e.g., cb.mu) and call the accessor methods or read fields under that single lock before emitting the log lines.tests/e2e/cluster_health_test.go (1)
107-114: Issue collection hardcodes section names; consider aligning withOnlySections.If
OnlySectionsis extended to include additional sections (e.g.,SectionPods), this loop won't capture their errors. Consider iterating based on the report structure or extracting section names to a shared constant.- for _, s := range []struct{ name, err string }{ - {"nodes", report.Nodes.Error}, - {"operator", report.Operator.Error}, - } { - if s.err != "" { - status.Issues = append(status.Issues, s.err) - } - } + if report.Nodes.Error != "" { + status.Issues = append(status.Issues, report.Nodes.Error) + } + if report.Operator.Error != "" { + status.Issues = append(status.Issues, report.Operator.Error) + }The simpler form removes unused
namefield and makes the sections explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/cluster_health_test.go` around lines 107 - 114, The loop that builds status.Issues currently hardcodes sections via a local struct with unused name and will miss any new sections (e.g., SectionPods) added to OnlySections; update the logic to iterate over the canonical list of report sections (use OnlySections or a shared slice/map of section keys) and for each section read the corresponding report.<Section>.Error (e.g., report.Nodes.Error, report.Operator.Error, report.Pods.Error) or use a report.Errors map if available, appending non-empty errors to status.Issues; remove the unused name field and ensure the source of section names is the single shared constant/variable so adding sections automatically includes their errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/e2e/controller_test.go`:
- Around line 316-322: The preflight failure branch calls
circuitBreaker.ForceTrip(...) but does not stop further execution; update the
block handling healthChecker.Check() so that after calling
circuitBreaker.ForceTrip(...) you immediately return from the test function to
prevent calling CleanupPreviousTestResources(t) and subsequent mustRun calls.
Locate the healthChecker.Check() call and its !preflight.Healthy branch and add
an early return right after circuitBreaker.ForceTrip(...) to enforce the circuit
breaker behavior.
- Around line 309-314: The cleanup runs in LIFO so the existing t.Cleanup that
calls circuitBreaker.LogSummary() and t.Fatalf can run after handleCleanup and
prevent DeletionPolicyOnFailure from running; modify handleCleanup to consider
circuit breaker trips as a failure condition by checking
circuitBreaker.TotalTrips() > 0 (or a Tripped() helper) in addition to
t.Failed(), and ensure handleCleanup triggers the same resource deletion/cleanup
path when trips > 0 so resources under DeletionPolicyOnFailure are cleaned even
if t.Failed() is false.
- Around line 576-596: The current outer check using the returned value `passed`
is unreliable for parallel subtests because `t.Run` can return early; instead,
inside the `t.Run(name, func(t *testing.T) { ... })` closure add a deferred
function that runs after the subtest finishes and, if `t.Failed()` is true,
calls `HandleTestFailure(name)` and logs/fails as needed; keep existing defers
(`HandleGlobalPanic`, the skip capture) and the `testFunc(t)` call, but remove
or disable the outer `if !passed { HandleTestFailure(name) ... }` block so
diagnostics run from the deferred check inside the subtest closure to catch
failures from `WithParallel()` cases.
---
Nitpick comments:
In `@tests/e2e/circuit_breaker_test.go`:
- Around line 194-209: LogSummary performs separate state reads via TotalTrips()
and TripReason(), allowing TOCTOU changes between calls; modify
CircuitBreaker.LogSummary to acquire the circuit breaker's mutex once around
reading both TotalTrips and TripReason (or directly read the underlying
counters/reason fields while holding cb.mu) so the logged trips count and reason
reflect the same consistent snapshot; use the same mutex name used in the type
(e.g., cb.mu) and call the accessor methods or read fields under that single
lock before emitting the log lines.
In `@tests/e2e/cluster_health_test.go`:
- Around line 107-114: The loop that builds status.Issues currently hardcodes
sections via a local struct with unused name and will miss any new sections
(e.g., SectionPods) added to OnlySections; update the logic to iterate over the
canonical list of report sections (use OnlySections or a shared slice/map of
section keys) and for each section read the corresponding report.<Section>.Error
(e.g., report.Nodes.Error, report.Operator.Error, report.Pods.Error) or use a
report.Errors map if available, appending non-empty errors to status.Issues;
remove the unused name field and ensure the source of section names is the
single shared constant/variable so adding sections automatically includes their
errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 0cbd3de3-996d-436f-9b9c-9240684a9053
📒 Files selected for processing (5)
tests/e2e/circuit_breaker_test.gotests/e2e/cluster_health_test.gotests/e2e/controller_test.gotests/e2e/failure_classifier_test.gotests/e2e/helper_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/e2e/failure_classifier_test.go
9842eda to
f05e535
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/controller_test.go`:
- Around line 599-600: Reorder the if-condition so it short-circuits on
!parallel before reading wasSkipped to avoid a data race; specifically update
the conditional that guards circuitBreaker.RecordResult(passed) (where variables
testExecuted, wasSkipped, parallel are used) to check !parallel first (e.g., if
testExecuted && !parallel && !wasSkipped) so the read of wasSkipped is skipped
when parallel is true.
- Around line 577-600: The code currently infers parallelization by checking
parallel := len(opts) > 0 which is unreliable for suites that internally call
RunTestCases(..., WithParallel()) (e.g., dscValidationTestSuite invoked via
mustRun), so replace this heuristic with an explicit "safe to record" signal:
add a new option/flag (e.g., WithSafeToRecord or SafeToRecord boolean) that can
be passed through opts and read here as safeToRecord (instead of parallel :=
len(opts) > 0), update callers of mustRun/RunTestCases and any suite functions
(like dscValidationTestSuite) to set that option when they guarantee no internal
parallel work remains at t.Run return, and finally use safeToRecord when
deciding whether to call circuitBreaker.RecordResult(passed); keep testExecuted
and wasSkipped logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 453e0638-6fad-4fbb-8756-a4f287078e8f
📒 Files selected for processing (5)
tests/e2e/circuit_breaker_test.gotests/e2e/cluster_health_test.gotests/e2e/controller_test.gotests/e2e/failure_classifier_test.gotests/e2e/helper_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/e2e/failure_classifier_test.go
| if testExecuted && !wasSkipped && !parallel { | ||
| circuitBreaker.RecordResult(passed) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the file and check its contents around the mentioned lines
fd controller_test.go | head -5Repository: opendatahub-io/opendatahub-operator
Length of output: 435
🏁 Script executed:
# Once found, let's read the file to understand the context
wc -l tests/e2e/controller_test.goRepository: opendatahub-io/opendatahub-operator
Length of output: 111
🏁 Script executed:
# Read the relevant section with context
sed -n '580,620p' tests/e2e/controller_test.goRepository: opendatahub-io/opendatahub-operator
Length of output: 695
🏁 Script executed:
# Get more context to see the function structure and variable declarations
sed -n '550,609p' tests/e2e/controller_test.goRepository: opendatahub-io/opendatahub-operator
Length of output: 1455
Reorder condition to short-circuit on !parallel before reading wasSkipped.
When a subtest calls t.Parallel(), t.Run() returns immediately before the test's deferred cleanup runs. Evaluating !wasSkipped while the test goroutine may be setting it via defer func() { wasSkipped = t.Skipped() }() creates a data race. Short-circuiting on !parallel prevents this read when parallel is true.
Suggested fix
-if testExecuted && !wasSkipped && !parallel {
+if testExecuted && !parallel && !wasSkipped {
circuitBreaker.RecordResult(passed)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if testExecuted && !wasSkipped && !parallel { | |
| circuitBreaker.RecordResult(passed) | |
| if testExecuted && !parallel && !wasSkipped { | |
| circuitBreaker.RecordResult(passed) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/controller_test.go` around lines 599 - 600, Reorder the
if-condition so it short-circuits on !parallel before reading wasSkipped to
avoid a data race; specifically update the conditional that guards
circuitBreaker.RecordResult(passed) (where variables testExecuted, wasSkipped,
parallel are used) to check !parallel first (e.g., if testExecuted && !parallel
&& !wasSkipped) so the read of wasSkipped is skipped when parallel is true.
|
/retest |
|
The classification piece you added here will duplicate the intent of #3222 right? |
Made-with: Cursor
|
/test opendatahub-operator-rhoai-e2e |
1 similar comment
|
/test opendatahub-operator-rhoai-e2e |
|
@den-rgb: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Adds a circuit breaker that halts test execution early when infrastructure failures are detected, instead of letting every test timeout individually against a broken cluster.
The circuit breaker operates at three levels:
Pre-flight check - Before any tests run, TestOdhOperator runs a cluster health check. If the cluster is already unhealthy (operator down, nodes not ready), the breaker trips immediately and all suites are skipped. This catches the most common case: tests launched against a broken environment.
Sequential detection - Between test suites, mustRun records pass/fail results. After N consecutive failures (configurable, default 3), a health check runs. If the cluster is unhealthy, the breaker trips and remaining suites are skipped. If healthy, the counter resets (failures are test logic, not infrastructure). Skipped test groups and parallel tests don't record results to avoid false counter resets.
How Has This Been Tested?
Screenshot or short clip
Merge criteria
E2E test suite update requirement
When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.
To opt-out of this requirement:
E2E update requirement opt-out justificationsection belowE2E update requirement opt-out justification
Summary by CodeRabbit