Skip to content

fix(llmisvc): defer migration until webhooks are serving#1251

Merged
openshift-merge-bot[bot] merged 1 commit intoopendatahub-io:masterfrom
bartoszmajsak:midstream/fix/webhook-storage-migration-timing
Mar 19, 2026
Merged

fix(llmisvc): defer migration until webhooks are serving#1251
openshift-merge-bot[bot] merged 1 commit intoopendatahub-io:masterfrom
bartoszmajsak:midstream/fix/webhook-storage-migration-timing

Conversation

@bartoszmajsak
Copy link

@bartoszmajsak bartoszmajsak commented Mar 19, 2026

What this PR does / why we need it:

The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits.

Moves migration into a LeaderElectionRunnable and it executes only after webhooks and caches are fully ready, with an exponential backoff to handle Service endpoint propagation delays.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes RHOAIENG-54344

Summary by CodeRabbit

  • Bug Fixes

    • Storage version migration now runs only on the elected leader, sequentially per resource with exponential backoff and clearer handling of retryable vs non-retryable errors.
    • Migration failure reporting improved to distinguish timeouts from other failures.
    • Controller startup sequence adjusted so webhooks/conversion registration occur before reconciler setup.
  • Tests

    • Added E2E test validating storage version migration and cleanup of stale stored versions after upgrade.

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 85195f17-dc64-4545-8cda-02e96175394a

📥 Commits

Reviewing files that changed from the base of the PR and between 9461c97 and 22bae41.

📒 Files selected for processing (3)
  • cmd/llmisvc/main.go
  • go.mod
  • test/e2e/llmisvc/test_storage_version_migration.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod

📝 Walkthrough

Walkthrough

Replaces concurrent goroutines used for storage-version migration with a leader-election-scoped runnable type that implements Start(context.Context) error and NeedLeaderElection() bool. Migration is executed sequentially on the elected leader with per-resource exponential backoff (wait.ExponentialBackoffWithContext) and stops retrying on IsForbidden, IsUnauthorized, or IsNotFound. Controller setup order was adjusted to register the reconciler after webhook and conversion webhook registration. Adds an end-to-end test validating storage-version migration and CRD status.storedVersions restoration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Architecture & Logic Issues

  • Leader election window / migration atomicity: running long sequential migrations under leader election risks partial runs if leadership changes, causing duplicate or overlapping operations (race condition, CWE-362). Action: ensure migrations are idempotent, add leader-held guarantees (e.g., persistent migration markers or per-resource migration status) and guard re-entry on leadership change.
  • Silent RBAC skips: IsForbidden/IsUnauthorized/IsNotFound cause immediate stop for a resource, which can hide RBAC misconfigurations or missing resources (privilege/authorization issues, CWE-276). Action: record events/metrics and surface explicit failure states; consider failing startup when expected permissions are missing.
  • Startup delay from backoff policy: current backoff (2s × factor 1.5 × 10 steps with jitter) per resource can meaningfully delay controller readiness or cause availability degradation (DoS-like hang, CWE-400). Action: bound total retry budget, add global timeout, and document SLO impact.
  • Ambiguous error classification on timeout: returned errors combine timeout vs terminal errors, reducing observability. Action: return distinct, typed errors and emit per-resource events/logs with the last non-stopping error for diagnostics.

Test Coverage

  • Teardown robustness: test teardown patches CRD status.storedVersions and optionally deletes resources; ensure teardown tolerates missing CRDs, logs patch failures, and never masks teardown errors.
  • Rollout and crash detection: test restarts the controller and waits for rollout completion but may miss transient controller crashes during migration. Action: add post-rollout liveness/readiness checks and consider asserting controller pod logs or migration-related events to detect silent failures.

Dependency Change

  • golang.org/x/sync moved from direct to indirect in go.mod. This reduces explicit pinning of that dependency and may change transitive resolution, increasing supply-chain risk (dependency/version mismatch, CWE-494). Action: confirm resolved version via go mod graph/go mod tidy and re-pin if deterministic control is required.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: deferring storage version migration until webhooks are ready, which is the core fix described in the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/llmisvc/main.go`:
- Around line 298-312: The retry loop inside wait.ExponentialBackoffWithContext
treats any non-nil error as fatal by returning (false, err) for
permission-related errors; change the condition in the anonymous func used with
wait.ExponentialBackoffWithContext so that when migrator.Migrate(ctx, gr)
returns apierrors.IsForbidden(err), apierrors.IsUnauthorized(err) or
apierrors.IsNotFound(err) you log via setupLog (including gr and err) and return
(true, nil) to mark the resource as skipped instead of aborting startup; keep
the existing behavior for other errors (set lastErr and return false, nil to
retry) and preserve the outer error handling that inspects lastErr and
wait.Interrupted.
- Around line 284-289: The migrationBackoff (variable migrationBackoff of type
wait.Backoff) currently uses Duration: 2s, Factor: 1.5, Steps: 12 which yields
~5.7 minutes per resource; rebalance these fields to target ~2 minutes total
(for example, reduce Steps from 12 to 8 while keeping Duration and Factor, or
alternatively increase Duration modestly or lower Factor) so the geometric
backoff sum matches the ~120s window, and if you choose to keep the existing
values update the comment to state the actual timeout budget instead; change the
wait.Backoff fields (Duration, Factor, Steps) in the migrationBackoff
declaration accordingly and ensure the comment reflects the chosen timeout.

In `@test/e2e/llmisvc/test_storage_version_migration.py`:
- Around line 83-97: The current _cleanup_resources restores storedVersions on
both CRDs unconditionally before resource deletion, which can hide leftover
objects when SKIP_RESOURCE_DELETION is set or deletion fails; change the flow so
that you only call apix_client.patch_custom_resource_definition_status for
LLMISVC_CONFIG_CRD_NAME and LLMISVC_CRD_NAME after successful resource deletion
(i.e., skip restoration early if SKIP_RESOURCE_DELETION is true) and ensure
restoration runs in the success path or a dedicated cleanup-on-success block
rather than up-front so failures or skipped deletions do not rewrite
storedVersions prematurely.
- Around line 123-125: The test uses a fixed config_name
("migration-test-config") which can collide between runs; change the assignment
of config_name in test_storage_version_migration.py to generate a unique name
(e.g., append a short uuid4/timestamp/random hex) so each test creates an
isolated resource; update any usages of config_name (and related config dict) so
they reference the new generated value (look for the variable config_name and
the config object creation) and ensure teardown/cleanup still targets the
generated name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 93f39ce2-47ed-4aa4-a2a7-da1a4fb54532

📥 Commits

Reviewing files that changed from the base of the PR and between f5b52dd and 32e65e6.

📒 Files selected for processing (3)
  • cmd/llmisvc/main.go
  • go.mod
  • test/e2e/llmisvc/test_storage_version_migration.py

Comment on lines +83 to +97
def _cleanup_resources(self):
"""Clean up created resources and restore CRD status."""
# Always restore CRD storedVersions to prevent dirty state
for crd_name in [LLMISVC_CONFIG_CRD_NAME, LLMISVC_CRD_NAME]:
try:
self.apix_client.patch_custom_resource_definition_status(
crd_name,
body={"status": {"storedVersions": ["v1alpha2"]}},
)
except Exception as e:
logger.warning(f"Failed to restore storedVersions for {crd_name}: {e}")

if os.getenv("SKIP_RESOURCE_DELETION", "False").lower() in ("true", "1", "t"):
logger.info("Skipping resource deletion after test execution.")
return
Copy link

@coderabbitai coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don’t restore storedVersions before cleanup succeeds.

This rewrites both CRDs back to ["v1alpha2"] even when SKIP_RESOURCE_DELETION is set or a later delete fails. On a failed/debug run, that can leave old objects behind while the CRD claims they no longer exist, which poisons later tests and destroys the state you wanted to inspect.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/llmisvc/test_storage_version_migration.py` around lines 83 - 97, The
current _cleanup_resources restores storedVersions on both CRDs unconditionally
before resource deletion, which can hide leftover objects when
SKIP_RESOURCE_DELETION is set or deletion fails; change the flow so that you
only call apix_client.patch_custom_resource_definition_status for
LLMISVC_CONFIG_CRD_NAME and LLMISVC_CRD_NAME after successful resource deletion
(i.e., skip restoration early if SKIP_RESOURCE_DELETION is true) and ensure
restoration runs in the success path or a dedicated cleanup-on-success block
rather than up-front so failures or skipped deletions do not rewrite
storedVersions prematurely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cluster-wide resource, so shouldn't we unconditionally clean-up?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartoszmajsak, that's a fair point. Since CRD storedVersions is a cluster-scoped state that affects every test run and every namespace, unconditional restoration is the correct hygiene choice — leaving it dirty would poison subsequent runs regardless of whether namespace-scoped resource deletion was skipped. The two concerns are genuinely orthogonal: SKIP_RESOURCE_DELETION governs namespace-scoped CRs, while storedVersions patching governs cluster-wide CRD status.

The one edge case worth keeping in mind: if the test fails before the migration completes (objects still encoded as v1alpha1 in etcd), force-resetting storedVersions to ["v1alpha2"] creates a lie in the CRD status. In practice that's still preferable to leaving stale versions that break the next run, but it means the next controller start won't attempt the migration and those v1alpha1-encoded objects will rely on the conversion webhook alone. Not a blocker — just something to be aware of when debugging mid-test failures.

Happy to withdraw the original comment if you agree the current unconditional approach is acceptable.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment on lines +123 to +125
# 1. Create a config resource via v1alpha1 so we have something to migrate
config_name = "migration-test-config"
config = {
Copy link

@coderabbitai coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a unique resource name here.

migration-test-config will collide with reruns, leaked resources, or parallel workers in the shared test namespace.

Suggested fix
-        config_name = "migration-test-config"
+        config_name = f"migration-test-config-{time.time_ns()}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/llmisvc/test_storage_version_migration.py` around lines 123 - 125,
The test uses a fixed config_name ("migration-test-config") which can collide
between runs; change the assignment of config_name in
test_storage_version_migration.py to generate a unique name (e.g., append a
short uuid4/timestamp/random hex) so each test creates an isolated resource;
update any usages of config_name (and related config dict) so they reference the
new generated value (look for the variable config_name and the config object
creation) and ensure teardown/cleanup still targets the generated name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A static name is simpler and matches the pattern used by other tests in this directory - it is also single test with this resource - is this really worth making a unique name?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check naming patterns in other test files in the same directory
rg -n 'name\s*=\s*["\x27][a-z]' test/e2e/llmisvc/ --type py -A1 | head -60

Length of output: 2803


@bartoszmajsak, you're right — the rest of the directory consistently uses static descriptive names (e.g., "conv-test-v1a1-to-v1a2", "stop-feature-test", "scheduler-inline-config-test"), and the cleanup fixture already handles the resource. The suggestion doesn't apply here.

Copy link

@KillianGolds KillianGolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

The leaderRunnable pattern is a clean way to utilise the manager lifecycle to avoid the timing issue.

I had 1 or 2 questions while reading through. Apologies if im asking anything thats been talked out elsewhere!

client_configuration=client.Configuration(),
)
self.apix_client = client.ApiextensionsV1Api()
self.apps_client = client.AppsV1Api()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: self.apps_client is assigned but never used, the restart goes through subprocess/kubectl.

Safe to remove?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that make precommit is happy :) Good catch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make precommit never wants to play nice with me at the best of times due to being on MacOS haha, so i tend to double check things its normally meant to check

@bartoszmajsak bartoszmajsak force-pushed the midstream/fix/webhook-storage-migration-timing branch from 32e65e6 to 9461c97 Compare March 19, 2026 16:46
The controller was crashing on startup because storage version migration
ran before the webhook server had registered Service endpoints. The
Knative migrator patches each resource with an empty merge patch, which
triggers validating webhooks - but with no endpoints available, the API
server rejects every patch and the controller exits.

Moves migration into a LeaderElectionRunnable so it executes only after
webhooks and caches are fully ready. Replaces the short DefaultBackoff
retry (~1.5s) with a generous exponential backoff (~2min window) to
handle Service endpoint propagation delays on platforms like OpenShift.

Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/llmisvc/test_storage_version_migration.py (1)

77-77: Remove unused apps_client assignment.

This was flagged in prior review and acknowledged—restart uses subprocess/kubectl, not the Kubernetes client.

Proposed fix
         self.apix_client = client.ApiextensionsV1Api()
-        self.apps_client = client.AppsV1Api()
         self.namespace = KSERVE_TEST_NAMESPACE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/llmisvc/test_storage_version_migration.py` at line 77, Remove the
unused Kubernetes client assignment by deleting the line that sets
self.apps_client = client.AppsV1Api() in the test (symbol: apps_client /
AppsV1Api); also remove any now-unused import of client or AppsV1Api from that
test file and ensure there are no remaining references to self.apps_client
elsewhere in test_storage_version_migration.py (run tests or search to confirm).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/llmisvc/test_storage_version_migration.py`:
- Line 77: Remove the unused Kubernetes client assignment by deleting the line
that sets self.apps_client = client.AppsV1Api() in the test (symbol: apps_client
/ AppsV1Api); also remove any now-unused import of client or AppsV1Api from that
test file and ensure there are no remaining references to self.apps_client
elsewhere in test_storage_version_migration.py (run tests or search to confirm).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8a0c7cc5-f6e1-4abe-b2b2-bed8c2f9c85b

📥 Commits

Reviewing files that changed from the base of the PR and between 32e65e6 and 9461c97.

📒 Files selected for processing (3)
  • cmd/llmisvc/main.go
  • go.mod
  • test/e2e/llmisvc/test_storage_version_migration.py
✅ Files skipped from review due to trivial changes (1)
  • go.mod

@bartoszmajsak bartoszmajsak force-pushed the midstream/fix/webhook-storage-migration-timing branch from 9461c97 to 22bae41 Compare March 19, 2026 17:09
Copy link
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

/lgtm
/approve

@openshift-ci
Copy link

openshift-ci bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asanzgom, bartoszmajsak, pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bartoszmajsak,pierDipi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bartoszmajsak
Copy link
Author

/override ci/prow/e2e-llm-inference-service

Waiting for #1196 to land. Tests are passing on Kind

@openshift-ci
Copy link

openshift-ci bot commented Mar 19, 2026

@bartoszmajsak: Overrode contexts on behalf of bartoszmajsak: ci/prow/e2e-llm-inference-service

Details

In response to this:

/override ci/prow/e2e-llm-inference-service

Waiting for #1196 to land. Tests are passing on Kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 81121b0 into opendatahub-io:master Mar 19, 2026
28 checks passed
@github-project-automation github-project-automation bot moved this from New/Backlog to Done in ODH Model Serving Planning Mar 19, 2026
bartoszmajsak added a commit to bartoszmajsak/kserve that referenced this pull request Mar 19, 2026
The controller was crashing on startup because storage version migration
ran before the webhook server had registered Service endpoints. The
Knative migrator patches each resource with an empty merge patch, which
triggers validating webhooks - but with no endpoints available, the API
server rejects every patch and the controller exits.

Moves migration into a LeaderElectionRunnable so it executes only after
webhooks and caches are fully ready. Replaces the short DefaultBackoff
retry (~1.5s) with a generous exponential backoff (~2min window) to
handle Service endpoint propagation delays on platforms like OpenShift.

Port of opendatahub-io#1251
bartoszmajsak added a commit to bartoszmajsak/kserve that referenced this pull request Mar 19, 2026
Fixes a startup crash in the `LLMInferenceService` controller caused by
storage version migration running before the webhook server has registered
Service endpoints. The Knative migrator patches each resource with an empty
merge patch, which triggers validating webhooks - but with no endpoints
available, the API server rejects every patch and the controller exits.

Moves migration into a `LeaderElectionRunnable` so it executes only after
webhooks and caches are fully ready, uses exponential backoff to handle
Service endpoint propagation delays.

Port of opendatahub-io#1251
bartoszmajsak added a commit to bartoszmajsak/kserve that referenced this pull request Mar 19, 2026
Fixes a startup crash in the `LLMInferenceService` controller caused by
storage version migration running before the webhook server has registered
Service endpoints. The Knative migrator patches each resource with an empty
merge patch, which triggers validating webhooks - but with no endpoints
available, the API server rejects every patch and the controller exits.

Moves migration into a `LeaderElectionRunnable` so it executes only after
webhooks and caches are fully ready, uses exponential backoff to handle
Service endpoint propagation delays.

Port of opendatahub-io#1251

Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
bartoszmajsak added a commit to bartoszmajsak/kserve that referenced this pull request Mar 19, 2026
Cherry-pick of  opendatahub-io#1251

Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344)

Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
pierDipi pushed a commit to red-hat-data-services/kserve that referenced this pull request Mar 19, 2026
…ing (#4276)

Cherry-pick of  opendatahub-io#1251

Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344)

Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants