Skip to content

added github action to run the e2e tests in a kind cluster#3250

Open
CFSNM wants to merge 1 commit intoopendatahub-io:mainfrom
CFSNM:kind_e2e_gh_action
Open

added github action to run the e2e tests in a kind cluster#3250
CFSNM wants to merge 1 commit intoopendatahub-io:mainfrom
CFSNM:kind_e2e_gh_action

Conversation

@CFSNM
Copy link
Member

@CFSNM CFSNM commented Mar 11, 2026

Description

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

Summary by CodeRabbit

  • Chores

    • Added a CI workflow to run KinD-based end-to-end tests on PRs and via manual trigger.
  • Tests

    • Expanded E2E coverage with new suites for dependent-operator and DSC scenarios and added runtime flags to control their execution.
  • Bug Fixes

    • Improved test utility error handling and added safer handling for missing API resources to make E2E runs more robust.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new GitHub Actions workflow at .github/workflows/test-kind-odh-e2e.yaml to run KinD-based end-to-end tests on PRs to main and via manual dispatch. Updates tests under tests/e2e: adds two boolean fields to TestContextConfig (dependantOperatorsManagementTest, dscManagementTest), binds new command-line flags/env vars, conditionally runs two test suites (including a new dependantOperatorsManagementTestSuite that initializes test context and sequences DSC-related test cases). Adjusts test helper error propagation in pkg/utils/test/testf and adds NoMatch handling in tests/e2e/test_context_test.go.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a GitHub Actions workflow to run E2E tests in a KinD cluster, which aligns with the primary file addition.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-kind-odh-e2e.yaml:
- Around line 42-44: Keep the existing "Verify ODH Deployment" kubectl wait
precondition (the kubectl wait --for=condition=Available ... line) but add a
subsequent step that runs the repository's actual end-to-end test entrypoint so
the job exercises E2E flows rather than just a smoke check; update the workflow
to add a new run step after the Verify ODH Deployment block that invokes the
repo E2E entrypoint (replace with the correct command for your project, e.g.,
the Make target or script that triggers the full E2E suite).
- Around line 30-40: The workflow builds the operator image with make
image-build but never makes that local image available to the KinD cluster
before make deploy rewrites manifests to $(IMG); add a step after the Build
Operator Image step to either rebuild with buildx --load (adjust make
image-build to pass --load) or, more straightforwardly, run kind load
docker-image ${{ env.IMG }} (or kind load docker-image "$(IMG)" for the same
named cluster) so the locally-built image is present in the KinD nodes before
running make deploy; update the CI steps to call the chosen approach after
image-build and before make deploy so manifests referencing $(IMG) use the
just-built image.
- Around line 10-18: The job kind-odh-e2e-tests runs PR code with full
GITHUB_TOKEN permissions; add a job-level permissions block under
kind-odh-e2e-tests to restrict the token to least privilege (for example set
permissions: contents: read and any other minimal scopes your workflow actually
needs) so the job no longer inherits broad repository defaults; update the
.github/workflows/test-kind-odh-e2e.yaml job definition to include this
permissions stanza for the kind-odh-e2e-tests job.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: adfe5100-83e7-4137-b16d-89a982942082

📥 Commits

Reviewing files that changed from the base of the PR and between b6c91b4 and 7fe5b84.

📒 Files selected for processing (1)
  • .github/workflows/test-kind-odh-e2e.yaml

@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch 8 times, most recently from c89e4e2 to 4e9a966 Compare March 11, 2026 14:03
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/e2e/controller_test.go (1)

429-432: Rename dependant to dependent before this flag ships.

These lines add a new user-facing CLI/env contract with a spelling mistake. If this lands as-is, fixing it later means carrying a permanent compatibility alias across flags, env vars, and code paths. Prefer test-dependent-operators-management / E2E_TEST_DEPENDENT_OPERATORS_MANAGEMENT and rename the matching symbols in this PR.

Also applies to: 496-497

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 429 - 432, The CLI flag and env
binding use the misspelled "dependant" in
pflag.Bool("test-dependant-operators-management", ...) and
viper.BindEnv(...+"_DEPENDANT_OPERATORS_MANAGEMENT"); change both to the correct
spelling "dependent" (i.e. pflag.Bool("test-dependent-operators-management",
...) and viper.BindEnv(...+"_DEPENDENT_OPERATORS_MANAGEMENT")) and update the
matching symbol names elsewhere in this PR (the other occurrences of the same
flag/env binding) so the flag, env var, and any code paths consistently use
"dependent" rather than "dependant".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-kind-odh-e2e.yaml:
- Around line 49-51: The KinD E2E step currently runs "make e2e-test" with no
explicit DSC management flag, so tests/e2e/controller_test.go defaults tag to
All and causes dscManagementTestSuite (Tier3-only) to run; update the Run E2E
Tests step where the command invokes "make e2e-test" and add the environment
variable -e E2E_TEST_DSC_MANAGEMENT=false (or set a narrower tag) so the
dscManagementTestSuite is excluded—look for the "Run E2E Tests" block and the
make e2e-test invocation to apply the change.
- Around line 28-31: The Create KinD Cluster step uses
helm/kind-action@ef37e7f... but doesn't pin the KinD node image; update that
step to include the node_image input and set it to a specific node image digest
(sha256) to fix supply-chain drift and ensure reproducible CI — locate the
"Create KinD Cluster" step and add the node_image field with the full
image@sha256:... value.

---

Nitpick comments:
In `@tests/e2e/controller_test.go`:
- Around line 429-432: The CLI flag and env binding use the misspelled
"dependant" in pflag.Bool("test-dependant-operators-management", ...) and
viper.BindEnv(...+"_DEPENDANT_OPERATORS_MANAGEMENT"); change both to the correct
spelling "dependent" (i.e. pflag.Bool("test-dependent-operators-management",
...) and viper.BindEnv(...+"_DEPENDENT_OPERATORS_MANAGEMENT")) and update the
matching symbol names elsewhere in this PR (the other occurrences of the same
flag/env binding) so the flag, env var, and any code paths consistently use
"dependent" rather than "dependant".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7ed27dc0-7342-4be6-a67a-22b1b078992b

📥 Commits

Reviewing files that changed from the base of the PR and between 8fa6b8b and 4e9a966.

📒 Files selected for processing (3)
  • .github/workflows/test-kind-odh-e2e.yaml
  • tests/e2e/controller_test.go
  • tests/e2e/creation_test.go

@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch 5 times, most recently from 82ae74a to 14fd4b8 Compare March 11, 2026 14:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
tests/e2e/controller_test.go (1)

86-87: Minor: Consider consistent spelling "dependent" vs "dependant".

"Dependant" (British) vs "dependent" (American) spelling. Both are valid, but the codebase should be consistent. Most Go projects use American English conventions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/controller_test.go` around lines 86 - 87, Rename the variable
dependantOperatorsManagementTest to use American spelling
dependentOperatorsManagementTest across the test file and any places that
reference it (e.g., flag declarations, struct fields, helper functions, and test
checks) to keep naming consistent; update all usages of
dependantOperatorsManagementTest in tests/e2e/controller_test.go and related
test files so the identifier matches the new name and run `go test` to ensure no
references remain.
tests/e2e/creation_test.go (1)

34-58: Consider extracting common test suite setup to reduce duplication.

Both dependantOperatorsManagementTestSuite and dscManagementTestSuite share identical boilerplate:

  • skipUnless(t, Tier3)
  • NewTestContext(t) with error handling
  • DSCTestCtx creation

This pattern is repeated 4+ times across the file (also in dscValidationTestSuite, dscWebhookTestSuite).

♻️ Helper function to reduce duplication
func newDSCTestCtx(t *testing.T) *DSCTestCtx {
    t.Helper()
    tc, err := NewTestContext(t)
    require.NoError(t, err, "Failed to initialize test context")
    return &DSCTestCtx{TestContext: tc}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/creation_test.go` around lines 34 - 58, Multiple test suites
(dependantOperatorsManagementTestSuite, dscManagementTestSuite,
dscValidationTestSuite, dscWebhookTestSuite) duplicate the same setup
(skipUnless(t, Tier3); NewTestContext(t) error handling; DSCTestCtx creation);
extract that into a helper like newDSCTestCtx(t *testing.T) that calls
t.Helper(), performs skipUnless(t, Tier3) if needed (or keep skip in each
suite), calls NewTestContext(t) with require.NoError, and returns DSCTestCtx (or
*DSCTestCtx) so each suite replaces the duplicated block with a single call to
newDSCTestCtx to reduce repetition and centralize initialization.
pkg/utils/test/testf/testf_witht.go (1)

128-131: Removing StopErr from List() changes retry behavior.

Previously, any error from client.List() would wrap with StopErr, causing Eventually to fail immediately. Now all errors trigger retries until timeout.

For genuine failures (e.g., RBAC issues, API server unavailable), tests will now wait for the full Eventually timeout instead of failing fast. This could significantly increase CI feedback time on infrastructure failures.

Consider wrapping only transient/retryable errors and using StopErr for permanent failures like meta.IsNoMatchError or k8serr.IsForbidden.

♻️ Suggested approach
 err := t.Client().List(ctx, &items, option...)
 if err != nil {
-    return nil, err
+    // Stop immediately for permanent errors; retry for transient ones
+    if k8serr.IsForbidden(err) || k8serr.IsUnauthorized(err) {
+        return nil, StopErr(err, "permanent error listing resources: %s", gvk)
+    }
+    return nil, err // Retry on transient errors
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/utils/test/testf/testf_witht.go` around lines 128 - 131, The change
removed StopErr wrapping from the t.Client().List call which causes Eventually
to retry on all errors; instead detect permanent errors (e.g.
meta.IsNoMatchError, k8serr.IsForbidden, API server unreachable) and wrap only
those in StopErr so Eventually fails fast. Update the error handling around
t.Client().List(ctx, &items, option...) in testf_witht.go to classify the
returned err (use helpers or inline checks for meta.IsNoMatchError and
k8serr.IsForbidden and any other non-transient conditions your infra considers
permanent) and return StopErr(err) for those; return the raw err (so it can be
retried) for transient/network errors.
tests/e2e/test_context_test.go (2)

621-624: NoMatchError handling silently returns nil, losing error context.

Returning (nil, nil) for NoMatchError is semantically identical to "resource not found" (subU == nil). Callers like isOperatorHealthy (line 933) cannot distinguish between:

  1. OLM Subscription CRD missing (configuration/environment issue)
  2. Subscription resource simply doesn't exist yet

Consider returning a sentinel error or logging when NoMatchError occurs so environment misconfigurations are detectable during debugging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_context_test.go` around lines 621 - 624, The current
meta.IsNoMatchError(err) branch swallows the error by returning (nil, nil);
change it to preserve context so callers like isOperatorHealthy can distinguish
missing CRD vs absent resource — either return a sentinel error (e.g.,
ErrCRDMissing) or wrap and return the original error (e.g., fmt.Errorf("CRD
missing: %w", err)), and/or emit a log entry before returning; update
callers/tests to check for the sentinel/wrapped error instead of treating nil as
"not found".

1311-1314: Consistent with FetchActualSubscription, but same caveat applies.

Same concern as the Subscription variant above. Environment misconfiguration (missing OLM CRDs) will appear as "CSV not found" rather than a distinct error, potentially causing confusion in CI debugging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_context_test.go` around lines 1311 - 1314, The current error
handling swallows meta.IsNoMatchError by returning nil,nil which hides
environment misconfiguration (missing OLM CRDs); change the branch in the
CSV/CSV-fetching code that checks meta.IsNoMatchError(err) so it does not return
a silent nil but instead returns a distinct, descriptive error (e.g. wrap the
original err with a message like "OLM CRDs not installed or API resource
missing") so callers can distinguish "CSV not found" from a cluster CRD
discovery problem; update the branch that currently uses
meta.IsNoMatchError(err) to return a wrapped error (preserve the original err)
rather than nil,nil, referencing the meta.IsNoMatchError check in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/utils/test/testf/testf_witht.go`:
- Around line 128-131: The change removed StopErr wrapping from the
t.Client().List call which causes Eventually to retry on all errors; instead
detect permanent errors (e.g. meta.IsNoMatchError, k8serr.IsForbidden, API
server unreachable) and wrap only those in StopErr so Eventually fails fast.
Update the error handling around t.Client().List(ctx, &items, option...) in
testf_witht.go to classify the returned err (use helpers or inline checks for
meta.IsNoMatchError and k8serr.IsForbidden and any other non-transient
conditions your infra considers permanent) and return StopErr(err) for those;
return the raw err (so it can be retried) for transient/network errors.

In `@tests/e2e/controller_test.go`:
- Around line 86-87: Rename the variable dependantOperatorsManagementTest to use
American spelling dependentOperatorsManagementTest across the test file and any
places that reference it (e.g., flag declarations, struct fields, helper
functions, and test checks) to keep naming consistent; update all usages of
dependantOperatorsManagementTest in tests/e2e/controller_test.go and related
test files so the identifier matches the new name and run `go test` to ensure no
references remain.

In `@tests/e2e/creation_test.go`:
- Around line 34-58: Multiple test suites
(dependantOperatorsManagementTestSuite, dscManagementTestSuite,
dscValidationTestSuite, dscWebhookTestSuite) duplicate the same setup
(skipUnless(t, Tier3); NewTestContext(t) error handling; DSCTestCtx creation);
extract that into a helper like newDSCTestCtx(t *testing.T) that calls
t.Helper(), performs skipUnless(t, Tier3) if needed (or keep skip in each
suite), calls NewTestContext(t) with require.NoError, and returns DSCTestCtx (or
*DSCTestCtx) so each suite replaces the duplicated block with a single call to
newDSCTestCtx to reduce repetition and centralize initialization.

In `@tests/e2e/test_context_test.go`:
- Around line 621-624: The current meta.IsNoMatchError(err) branch swallows the
error by returning (nil, nil); change it to preserve context so callers like
isOperatorHealthy can distinguish missing CRD vs absent resource — either return
a sentinel error (e.g., ErrCRDMissing) or wrap and return the original error
(e.g., fmt.Errorf("CRD missing: %w", err)), and/or emit a log entry before
returning; update callers/tests to check for the sentinel/wrapped error instead
of treating nil as "not found".
- Around line 1311-1314: The current error handling swallows meta.IsNoMatchError
by returning nil,nil which hides environment misconfiguration (missing OLM
CRDs); change the branch in the CSV/CSV-fetching code that checks
meta.IsNoMatchError(err) so it does not return a silent nil but instead returns
a distinct, descriptive error (e.g. wrap the original err with a message like
"OLM CRDs not installed or API resource missing") so callers can distinguish
"CSV not found" from a cluster CRD discovery problem; update the branch that
currently uses meta.IsNoMatchError(err) to return a wrapped error (preserve the
original err) rather than nil,nil, referencing the meta.IsNoMatchError check in
this file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7510ee22-ebbf-4ae4-a4bc-55c7cb840fc3

📥 Commits

Reviewing files that changed from the base of the PR and between 4e9a966 and 82ae74a.

📒 Files selected for processing (5)
  • .github/workflows/test-kind-odh-e2e.yaml
  • pkg/utils/test/testf/testf_witht.go
  • tests/e2e/controller_test.go
  • tests/e2e/creation_test.go
  • tests/e2e/test_context_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/test-kind-odh-e2e.yaml

@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch 3 times, most recently from 291907e to dd90c18 Compare March 11, 2026 14:57
@zdtsw
Copy link
Member

zdtsw commented Mar 12, 2026

will this ever work? kind is xKS not ocp

@zdtsw
Copy link
Member

zdtsw commented Mar 12, 2026

also the node provisioned by github has enough resource to run operator e2e?

@CFSNM
Copy link
Member Author

CFSNM commented Mar 12, 2026

@zdtsw this is a new feature Platform team is working on to support ODH installation in non-OCP clusters. This is still a PR in development to support running the e2e tests in a kinD cluster

@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch 9 times, most recently from 8c97ea2 to 0c7eb27 Compare March 12, 2026 14:00
@CFSNM CFSNM self-assigned this Mar 12, 2026
@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch from 0c7eb27 to 9730540 Compare March 12, 2026 14:18
@openshift-ci
Copy link

openshift-ci bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cfsnm. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch 2 times, most recently from 0a728ac to ffc64b2 Compare March 12, 2026 14:47
@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.76%. Comparing base (218b9d3) to head (ffc64b2).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
pkg/utils/test/testf/testf_witht.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3250   +/-   ##
=======================================
  Coverage   50.76%   50.76%           
=======================================
  Files         192      192           
  Lines       14027    14027           
=======================================
  Hits         7121     7121           
  Misses       6162     6162           
  Partials      744      744           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@valdar valdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that running the whole e2e test suit against a kind cluster won't work, a new way to deploy the operator in "xks mode" along with the cloud controller manager is needed ?

@valdar
Copy link
Member

valdar commented Mar 13, 2026

@CFSNM can you rebase on latest main?

@CFSNM
Copy link
Member Author

CFSNM commented Mar 13, 2026

@valdar the idea is just running the kserve tests in kinD, and we may need a new way to deploy it in KinD as well

@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch from 0b81de7 to 874f26f Compare March 13, 2026 13:58
@CFSNM CFSNM force-pushed the kind_e2e_gh_action branch from 874f26f to eeac1e4 Compare March 13, 2026 14:41
@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

@CFSNM: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-rhoai-e2e eeac1e4 link true /test opendatahub-operator-rhoai-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants