fix(llmisvc): defer migration until webhooks are serving by bartoszmajsak · Pull Request #1251 · opendatahub-io/kserve

bartoszmajsak · 2026-03-19T16:02:24Z

What this PR does / why we need it:

The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits.

Moves migration into a LeaderElectionRunnable and it executes only after webhooks and caches are fully ready, with an exponential backoff to handle Service endpoint propagation delays.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes RHOAIENG-54344

Summary by CodeRabbit

Bug Fixes
- Storage version migration now runs only on the elected leader, sequentially per resource with exponential backoff and clearer handling of retryable vs non-retryable errors.
- Migration failure reporting improved to distinguish timeouts from other failures.
- Controller startup sequence adjusted so webhooks/conversion registration occur before reconciler setup.
Tests
- Added E2E test validating storage version migration and cleanup of stale stored versions after upgrade.

coderabbitai · 2026-03-19T16:05:09Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 85195f17-dc64-4545-8cda-02e96175394a

📥 Commits

Reviewing files that changed from the base of the PR and between 9461c97 and 22bae41.

📒 Files selected for processing (3)

cmd/llmisvc/main.go
go.mod
test/e2e/llmisvc/test_storage_version_migration.py

🚧 Files skipped from review as they are similar to previous changes (1)

go.mod

📝 Walkthrough

Walkthrough

Replaces concurrent goroutines used for storage-version migration with a leader-election-scoped runnable type that implements Start(context.Context) error and NeedLeaderElection() bool. Migration is executed sequentially on the elected leader with per-resource exponential backoff (wait.ExponentialBackoffWithContext) and stops retrying on IsForbidden, IsUnauthorized, or IsNotFound. Controller setup order was adjusted to register the reconciler after webhook and conversion webhook registration. Adds an end-to-end test validating storage-version migration and CRD status.storedVersions restoration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Architecture & Logic Issues

Leader election window / migration atomicity: running long sequential migrations under leader election risks partial runs if leadership changes, causing duplicate or overlapping operations (race condition, CWE-362). Action: ensure migrations are idempotent, add leader-held guarantees (e.g., persistent migration markers or per-resource migration status) and guard re-entry on leadership change.
Silent RBAC skips: IsForbidden/IsUnauthorized/IsNotFound cause immediate stop for a resource, which can hide RBAC misconfigurations or missing resources (privilege/authorization issues, CWE-276). Action: record events/metrics and surface explicit failure states; consider failing startup when expected permissions are missing.
Startup delay from backoff policy: current backoff (2s × factor 1.5 × 10 steps with jitter) per resource can meaningfully delay controller readiness or cause availability degradation (DoS-like hang, CWE-400). Action: bound total retry budget, add global timeout, and document SLO impact.
Ambiguous error classification on timeout: returned errors combine timeout vs terminal errors, reducing observability. Action: return distinct, typed errors and emit per-resource events/logs with the last non-stopping error for diagnostics.

Test Coverage

Teardown robustness: test teardown patches CRD status.storedVersions and optionally deletes resources; ensure teardown tolerates missing CRDs, logs patch failures, and never masks teardown errors.
Rollout and crash detection: test restarts the controller and waits for rollout completion but may miss transient controller crashes during migration. Action: add post-rollout liveness/readiness checks and consider asserting controller pod logs or migration-related events to detect silent failures.

Dependency Change

golang.org/x/sync moved from direct to indirect in go.mod. This reduces explicit pinning of that dependency and may change transitive resolution, increasing supply-chain risk (dependency/version mismatch, CWE-494). Action: confirm resolved version via go mod graph/go mod tidy and re-pin if deterministic control is required.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: deferring storage version migration until webhooks are ready, which is the core fix described in the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/llmisvc/main.go`:
- Around line 298-312: The retry loop inside wait.ExponentialBackoffWithContext
treats any non-nil error as fatal by returning (false, err) for
permission-related errors; change the condition in the anonymous func used with
wait.ExponentialBackoffWithContext so that when migrator.Migrate(ctx, gr)
returns apierrors.IsForbidden(err), apierrors.IsUnauthorized(err) or
apierrors.IsNotFound(err) you log via setupLog (including gr and err) and return
(true, nil) to mark the resource as skipped instead of aborting startup; keep
the existing behavior for other errors (set lastErr and return false, nil to
retry) and preserve the outer error handling that inspects lastErr and
wait.Interrupted.
- Around line 284-289: The migrationBackoff (variable migrationBackoff of type
wait.Backoff) currently uses Duration: 2s, Factor: 1.5, Steps: 12 which yields
~5.7 minutes per resource; rebalance these fields to target ~2 minutes total
(for example, reduce Steps from 12 to 8 while keeping Duration and Factor, or
alternatively increase Duration modestly or lower Factor) so the geometric
backoff sum matches the ~120s window, and if you choose to keep the existing
values update the comment to state the actual timeout budget instead; change the
wait.Backoff fields (Duration, Factor, Steps) in the migrationBackoff
declaration accordingly and ensure the comment reflects the chosen timeout.

In `@test/e2e/llmisvc/test_storage_version_migration.py`:
- Around line 83-97: The current _cleanup_resources restores storedVersions on
both CRDs unconditionally before resource deletion, which can hide leftover
objects when SKIP_RESOURCE_DELETION is set or deletion fails; change the flow so
that you only call apix_client.patch_custom_resource_definition_status for
LLMISVC_CONFIG_CRD_NAME and LLMISVC_CRD_NAME after successful resource deletion
(i.e., skip restoration early if SKIP_RESOURCE_DELETION is true) and ensure
restoration runs in the success path or a dedicated cleanup-on-success block
rather than up-front so failures or skipped deletions do not rewrite
storedVersions prematurely.
- Around line 123-125: The test uses a fixed config_name
("migration-test-config") which can collide between runs; change the assignment
of config_name in test_storage_version_migration.py to generate a unique name
(e.g., append a short uuid4/timestamp/random hex) so each test creates an
isolated resource; update any usages of config_name (and related config dict) so
they reference the new generated value (look for the variable config_name and
the config object creation) and ensure teardown/cleanup still targets the
generated name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 93f39ce2-47ed-4aa4-a2a7-da1a4fb54532

📥 Commits

Reviewing files that changed from the base of the PR and between f5b52dd and 32e65e6.

📒 Files selected for processing (3)

cmd/llmisvc/main.go
go.mod
test/e2e/llmisvc/test_storage_version_migration.py

cmd/llmisvc/main.go

coderabbitai · 2026-03-19T16:16:54Z

test/e2e/llmisvc/test_storage_version_migration.py

+    def _cleanup_resources(self):
+        """Clean up created resources and restore CRD status."""
+        # Always restore CRD storedVersions to prevent dirty state
+        for crd_name in [LLMISVC_CONFIG_CRD_NAME, LLMISVC_CRD_NAME]:
+            try:
+                self.apix_client.patch_custom_resource_definition_status(
+                    crd_name,
+                    body={"status": {"storedVersions": ["v1alpha2"]}},
+                )
+            except Exception as e:
+                logger.warning(f"Failed to restore storedVersions for {crd_name}: {e}")
+
+        if os.getenv("SKIP_RESOURCE_DELETION", "False").lower() in ("true", "1", "t"):
+            logger.info("Skipping resource deletion after test execution.")
+            return


⚠️ Potential issue | 🟡 Minor

Don’t restore storedVersions before cleanup succeeds.

This rewrites both CRDs back to ["v1alpha2"] even when SKIP_RESOURCE_DELETION is set or a later delete fails. On a failed/debug run, that can leave old objects behind while the CRD claims they no longer exist, which poisons later tests and destroys the state you wanted to inspect.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/llmisvc/test_storage_version_migration.py` around lines 83 - 97, The current _cleanup_resources restores storedVersions on both CRDs unconditionally before resource deletion, which can hide leftover objects when SKIP_RESOURCE_DELETION is set or deletion fails; change the flow so that you only call apix_client.patch_custom_resource_definition_status for LLMISVC_CONFIG_CRD_NAME and LLMISVC_CRD_NAME after successful resource deletion (i.e., skip restoration early if SKIP_RESOURCE_DELETION is true) and ensure restoration runs in the success path or a dedicated cleanup-on-success block rather than up-front so failures or skipped deletions do not rewrite storedVersions prematurely.

It's cluster-wide resource, so shouldn't we unconditionally clean-up?

@bartoszmajsak, that's a fair point. Since CRD storedVersions is a cluster-scoped state that affects every test run and every namespace, unconditional restoration is the correct hygiene choice — leaving it dirty would poison subsequent runs regardless of whether namespace-scoped resource deletion was skipped. The two concerns are genuinely orthogonal: SKIP_RESOURCE_DELETION governs namespace-scoped CRs, while storedVersions patching governs cluster-wide CRD status.

The one edge case worth keeping in mind: if the test fails before the migration completes (objects still encoded as v1alpha1 in etcd), force-resetting storedVersions to ["v1alpha2"] creates a lie in the CRD status. In practice that's still preferable to leaving stale versions that break the next run, but it means the next controller start won't attempt the migration and those v1alpha1-encoded objects will rely on the conversion webhook alone. Not a blocker — just something to be aware of when debugging mid-test failures.

Happy to withdraw the original comment if you agree the current unconditional approach is acceptable.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai · 2026-03-19T16:16:54Z

test/e2e/llmisvc/test_storage_version_migration.py

+        # 1. Create a config resource via v1alpha1 so we have something to migrate
+        config_name = "migration-test-config"
+        config = {


⚠️ Potential issue | 🟡 Minor

Use a unique resource name here.

migration-test-config will collide with reruns, leaked resources, or parallel workers in the shared test namespace.

Suggested fix

- config_name = "migration-test-config" + config_name = f"migration-test-config-{time.time_ns()}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/llmisvc/test_storage_version_migration.py` around lines 123 - 125, The test uses a fixed config_name ("migration-test-config") which can collide between runs; change the assignment of config_name in test_storage_version_migration.py to generate a unique name (e.g., append a short uuid4/timestamp/random hex) so each test creates an isolated resource; update any usages of config_name (and related config dict) so they reference the new generated value (look for the variable config_name and the config object creation) and ensure teardown/cleanup still targets the generated name.

A static name is simpler and matches the pattern used by other tests in this directory - it is also single test with this resource - is this really worth making a unique name?

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check naming patterns in other test files in the same directory rg -n 'name\s*=\s*["\x27][a-z]' test/e2e/llmisvc/ --type py -A1 | head -60

Length of output: 2803

@bartoszmajsak, you're right — the rest of the directory consistently uses static descriptive names (e.g., "conv-test-v1a1-to-v1a2", "stop-feature-test", "scheduler-inline-config-test"), and the cleanup fixture already handles the resource. The suggestion doesn't apply here.

KillianGolds

Nice!

The leaderRunnable pattern is a clean way to utilise the manager lifecycle to avoid the timing issue.

I had 1 or 2 questions while reading through. Apologies if im asking anything thats been talked out elsewhere!

cmd/llmisvc/main.go

KillianGolds · 2026-03-19T16:31:46Z

test/e2e/llmisvc/test_storage_version_migration.py

+            client_configuration=client.Configuration(),
+        )
+        self.apix_client = client.ApiextensionsV1Api()
+        self.apps_client = client.AppsV1Api()


Nit: self.apps_client is assigned but never used, the restart goes through subprocess/kubectl.

Safe to remove?

Interesting that make precommit is happy :) Good catch.

make precommit never wants to play nice with me at the best of times due to being on MacOS haha, so i tend to double check things its normally meant to check

The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a LeaderElectionRunnable so it executes only after webhooks and caches are fully ready. Replaces the short DefaultBackoff retry (~1.5s) with a generous exponential backoff (~2min window) to handle Service endpoint propagation delays on platforms like OpenShift. Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344)

coderabbitai

🧹 Nitpick comments (1)

test/e2e/llmisvc/test_storage_version_migration.py (1)

77-77: Remove unused apps_client assignment.

This was flagged in prior review and acknowledged—restart uses subprocess/kubectl, not the Kubernetes client.

Proposed fix

         self.apix_client = client.ApiextensionsV1Api()
-        self.apps_client = client.AppsV1Api()
         self.namespace = KSERVE_TEST_NAMESPACE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/llmisvc/test_storage_version_migration.py` at line 77, Remove the
unused Kubernetes client assignment by deleting the line that sets
self.apps_client = client.AppsV1Api() in the test (symbol: apps_client /
AppsV1Api); also remove any now-unused import of client or AppsV1Api from that
test file and ensure there are no remaining references to self.apps_client
elsewhere in test_storage_version_migration.py (run tests or search to confirm).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/llmisvc/test_storage_version_migration.py`:
- Line 77: Remove the unused Kubernetes client assignment by deleting the line
that sets self.apps_client = client.AppsV1Api() in the test (symbol: apps_client
/ AppsV1Api); also remove any now-unused import of client or AppsV1Api from that
test file and ensure there are no remaining references to self.apps_client
elsewhere in test_storage_version_migration.py (run tests or search to confirm).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8a0c7cc5-f6e1-4abe-b2b2-bed8c2f9c85b

📥 Commits

Reviewing files that changed from the base of the PR and between 32e65e6 and 9461c97.

📒 Files selected for processing (3)

cmd/llmisvc/main.go
go.mod
test/e2e/llmisvc/test_storage_version_migration.py

✅ Files skipped from review due to trivial changes (1)

go.mod

pierDipi

Thank you!

/lgtm
/approve

openshift-ci · 2026-03-19T18:17:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asanzgom, bartoszmajsak, pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bartoszmajsak,pierDipi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bartoszmajsak · 2026-03-19T19:01:24Z

/override ci/prow/e2e-llm-inference-service

Waiting for #1196 to land. Tests are passing on Kind

openshift-ci · 2026-03-19T19:01:31Z

@bartoszmajsak: Overrode contexts on behalf of bartoszmajsak: ci/prow/e2e-llm-inference-service

Details

In response to this:

/override ci/prow/e2e-llm-inference-service

Waiting for #1196 to land. Tests are passing on Kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a LeaderElectionRunnable so it executes only after webhooks and caches are fully ready. Replaces the short DefaultBackoff retry (~1.5s) with a generous exponential backoff (~2min window) to handle Service endpoint propagation delays on platforms like OpenShift. Port of opendatahub-io#1251

Fixes a startup crash in the `LLMInferenceService` controller caused by storage version migration running before the webhook server has registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a `LeaderElectionRunnable` so it executes only after webhooks and caches are fully ready, uses exponential backoff to handle Service endpoint propagation delays. Port of opendatahub-io#1251

Fixes a startup crash in the `LLMInferenceService` controller caused by storage version migration running before the webhook server has registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a `LeaderElectionRunnable` so it executes only after webhooks and caches are fully ready, uses exponential backoff to handle Service endpoint propagation delays. Port of opendatahub-io#1251 Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>

Cherry-pick of opendatahub-io#1251 Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344) Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>

…ing (#4276) Cherry-pick of opendatahub-io#1251 Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344) Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>

bartoszmajsak requested review from KillianGolds and pierDipi March 19, 2026 16:02

github-project-automation bot added this to ODH Model Serving Planning Mar 19, 2026

github-project-automation bot moved this to New/Backlog in ODH Model Serving Planning Mar 19, 2026

openshift-ci bot added the approved label Mar 19, 2026

bartoszmajsak mentioned this pull request Mar 19, 2026

fix(llmisvc): prevent migration deadlock during controller startup #1250

Closed

6 tasks

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

KillianGolds reviewed Mar 19, 2026

View reviewed changes

asanzgom approved these changes Mar 19, 2026

View reviewed changes

bartoszmajsak force-pushed the midstream/fix/webhook-storage-migration-timing branch from 32e65e6 to 9461c97 Compare March 19, 2026 16:46

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

bartoszmajsak force-pushed the midstream/fix/webhook-storage-migration-timing branch from 9461c97 to 22bae41 Compare March 19, 2026 17:09

pierDipi approved these changes Mar 19, 2026

View reviewed changes

openshift-ci bot assigned pierDipi Mar 19, 2026

openshift-ci bot added the lgtm label Mar 19, 2026

openshift-merge-bot bot merged commit 81121b0 into opendatahub-io:master Mar 19, 2026
28 checks passed

github-project-automation bot moved this from New/Backlog to Done in ODH Model Serving Planning Mar 19, 2026

bartoszmajsak mentioned this pull request Mar 19, 2026

fix(llmisvc): defer storage migration until webhooks are serving kserve/kserve#5286

Open

4 tasks

bartoszmajsak mentioned this pull request Mar 19, 2026

fix(llmisvc): defer storage version migration until webhooks are serving red-hat-data-services/kserve#4276

Merged

Conversation

bartoszmajsak commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Architecture & Logic Issues

Test Coverage

Dependency Change

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartoszmajsak Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartoszmajsak Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

KillianGolds left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KillianGolds Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

bartoszmajsak Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

KillianGolds Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

pierDipi left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

bartoszmajsak commented Mar 19, 2026

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bartoszmajsak commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading

coderabbitai bot Mar 19, 2026 •

edited

Loading

coderabbitai bot Mar 19, 2026 •

edited

Loading