fix(llmisvc): defer migration until webhooks are serving#1251
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplaces concurrent goroutines used for storage-version migration with a leader-election-scoped runnable type that implements Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Architecture & Logic Issues
Test Coverage
Dependency Change
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/llmisvc/main.go`:
- Around line 298-312: The retry loop inside wait.ExponentialBackoffWithContext
treats any non-nil error as fatal by returning (false, err) for
permission-related errors; change the condition in the anonymous func used with
wait.ExponentialBackoffWithContext so that when migrator.Migrate(ctx, gr)
returns apierrors.IsForbidden(err), apierrors.IsUnauthorized(err) or
apierrors.IsNotFound(err) you log via setupLog (including gr and err) and return
(true, nil) to mark the resource as skipped instead of aborting startup; keep
the existing behavior for other errors (set lastErr and return false, nil to
retry) and preserve the outer error handling that inspects lastErr and
wait.Interrupted.
- Around line 284-289: The migrationBackoff (variable migrationBackoff of type
wait.Backoff) currently uses Duration: 2s, Factor: 1.5, Steps: 12 which yields
~5.7 minutes per resource; rebalance these fields to target ~2 minutes total
(for example, reduce Steps from 12 to 8 while keeping Duration and Factor, or
alternatively increase Duration modestly or lower Factor) so the geometric
backoff sum matches the ~120s window, and if you choose to keep the existing
values update the comment to state the actual timeout budget instead; change the
wait.Backoff fields (Duration, Factor, Steps) in the migrationBackoff
declaration accordingly and ensure the comment reflects the chosen timeout.
In `@test/e2e/llmisvc/test_storage_version_migration.py`:
- Around line 83-97: The current _cleanup_resources restores storedVersions on
both CRDs unconditionally before resource deletion, which can hide leftover
objects when SKIP_RESOURCE_DELETION is set or deletion fails; change the flow so
that you only call apix_client.patch_custom_resource_definition_status for
LLMISVC_CONFIG_CRD_NAME and LLMISVC_CRD_NAME after successful resource deletion
(i.e., skip restoration early if SKIP_RESOURCE_DELETION is true) and ensure
restoration runs in the success path or a dedicated cleanup-on-success block
rather than up-front so failures or skipped deletions do not rewrite
storedVersions prematurely.
- Around line 123-125: The test uses a fixed config_name
("migration-test-config") which can collide between runs; change the assignment
of config_name in test_storage_version_migration.py to generate a unique name
(e.g., append a short uuid4/timestamp/random hex) so each test creates an
isolated resource; update any usages of config_name (and related config dict) so
they reference the new generated value (look for the variable config_name and
the config object creation) and ensure teardown/cleanup still targets the
generated name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 93f39ce2-47ed-4aa4-a2a7-da1a4fb54532
📒 Files selected for processing (3)
cmd/llmisvc/main.gogo.modtest/e2e/llmisvc/test_storage_version_migration.py
| def _cleanup_resources(self): | ||
| """Clean up created resources and restore CRD status.""" | ||
| # Always restore CRD storedVersions to prevent dirty state | ||
| for crd_name in [LLMISVC_CONFIG_CRD_NAME, LLMISVC_CRD_NAME]: | ||
| try: | ||
| self.apix_client.patch_custom_resource_definition_status( | ||
| crd_name, | ||
| body={"status": {"storedVersions": ["v1alpha2"]}}, | ||
| ) | ||
| except Exception as e: | ||
| logger.warning(f"Failed to restore storedVersions for {crd_name}: {e}") | ||
|
|
||
| if os.getenv("SKIP_RESOURCE_DELETION", "False").lower() in ("true", "1", "t"): | ||
| logger.info("Skipping resource deletion after test execution.") | ||
| return |
There was a problem hiding this comment.
Don’t restore storedVersions before cleanup succeeds.
This rewrites both CRDs back to ["v1alpha2"] even when SKIP_RESOURCE_DELETION is set or a later delete fails. On a failed/debug run, that can leave old objects behind while the CRD claims they no longer exist, which poisons later tests and destroys the state you wanted to inspect.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/llmisvc/test_storage_version_migration.py` around lines 83 - 97, The
current _cleanup_resources restores storedVersions on both CRDs unconditionally
before resource deletion, which can hide leftover objects when
SKIP_RESOURCE_DELETION is set or deletion fails; change the flow so that you
only call apix_client.patch_custom_resource_definition_status for
LLMISVC_CONFIG_CRD_NAME and LLMISVC_CRD_NAME after successful resource deletion
(i.e., skip restoration early if SKIP_RESOURCE_DELETION is true) and ensure
restoration runs in the success path or a dedicated cleanup-on-success block
rather than up-front so failures or skipped deletions do not rewrite
storedVersions prematurely.
There was a problem hiding this comment.
It's cluster-wide resource, so shouldn't we unconditionally clean-up?
There was a problem hiding this comment.
@bartoszmajsak, that's a fair point. Since CRD storedVersions is a cluster-scoped state that affects every test run and every namespace, unconditional restoration is the correct hygiene choice — leaving it dirty would poison subsequent runs regardless of whether namespace-scoped resource deletion was skipped. The two concerns are genuinely orthogonal: SKIP_RESOURCE_DELETION governs namespace-scoped CRs, while storedVersions patching governs cluster-wide CRD status.
The one edge case worth keeping in mind: if the test fails before the migration completes (objects still encoded as v1alpha1 in etcd), force-resetting storedVersions to ["v1alpha2"] creates a lie in the CRD status. In practice that's still preferable to leaving stale versions that break the next run, but it means the next controller start won't attempt the migration and those v1alpha1-encoded objects will rely on the conversion webhook alone. Not a blocker — just something to be aware of when debugging mid-test failures.
Happy to withdraw the original comment if you agree the current unconditional approach is acceptable.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| # 1. Create a config resource via v1alpha1 so we have something to migrate | ||
| config_name = "migration-test-config" | ||
| config = { |
There was a problem hiding this comment.
Use a unique resource name here.
migration-test-config will collide with reruns, leaked resources, or parallel workers in the shared test namespace.
Suggested fix
- config_name = "migration-test-config"
+ config_name = f"migration-test-config-{time.time_ns()}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/llmisvc/test_storage_version_migration.py` around lines 123 - 125,
The test uses a fixed config_name ("migration-test-config") which can collide
between runs; change the assignment of config_name in
test_storage_version_migration.py to generate a unique name (e.g., append a
short uuid4/timestamp/random hex) so each test creates an isolated resource;
update any usages of config_name (and related config dict) so they reference the
new generated value (look for the variable config_name and the config object
creation) and ensure teardown/cleanup still targets the generated name.
There was a problem hiding this comment.
A static name is simpler and matches the pattern used by other tests in this directory - it is also single test with this resource - is this really worth making a unique name?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check naming patterns in other test files in the same directory
rg -n 'name\s*=\s*["\x27][a-z]' test/e2e/llmisvc/ --type py -A1 | head -60
Length of output: 2803
@bartoszmajsak, you're right — the rest of the directory consistently uses static descriptive names (e.g., "conv-test-v1a1-to-v1a2", "stop-feature-test", "scheduler-inline-config-test"), and the cleanup fixture already handles the resource. The suggestion doesn't apply here.
KillianGolds
left a comment
There was a problem hiding this comment.
Nice!
The leaderRunnable pattern is a clean way to utilise the manager lifecycle to avoid the timing issue.
I had 1 or 2 questions while reading through. Apologies if im asking anything thats been talked out elsewhere!
| client_configuration=client.Configuration(), | ||
| ) | ||
| self.apix_client = client.ApiextensionsV1Api() | ||
| self.apps_client = client.AppsV1Api() |
There was a problem hiding this comment.
Nit: self.apps_client is assigned but never used, the restart goes through subprocess/kubectl.
Safe to remove?
There was a problem hiding this comment.
Interesting that make precommit is happy :) Good catch.
There was a problem hiding this comment.
make precommit never wants to play nice with me at the best of times due to being on MacOS haha, so i tend to double check things its normally meant to check
32e65e6 to
9461c97
Compare
The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a LeaderElectionRunnable so it executes only after webhooks and caches are fully ready. Replaces the short DefaultBackoff retry (~1.5s) with a generous exponential backoff (~2min window) to handle Service endpoint propagation delays on platforms like OpenShift. Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344)
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/llmisvc/test_storage_version_migration.py (1)
77-77: Remove unusedapps_clientassignment.This was flagged in prior review and acknowledged—restart uses
subprocess/kubectl, not the Kubernetes client.Proposed fix
self.apix_client = client.ApiextensionsV1Api() - self.apps_client = client.AppsV1Api() self.namespace = KSERVE_TEST_NAMESPACE🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/llmisvc/test_storage_version_migration.py` at line 77, Remove the unused Kubernetes client assignment by deleting the line that sets self.apps_client = client.AppsV1Api() in the test (symbol: apps_client / AppsV1Api); also remove any now-unused import of client or AppsV1Api from that test file and ensure there are no remaining references to self.apps_client elsewhere in test_storage_version_migration.py (run tests or search to confirm).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/llmisvc/test_storage_version_migration.py`:
- Line 77: Remove the unused Kubernetes client assignment by deleting the line
that sets self.apps_client = client.AppsV1Api() in the test (symbol: apps_client
/ AppsV1Api); also remove any now-unused import of client or AppsV1Api from that
test file and ensure there are no remaining references to self.apps_client
elsewhere in test_storage_version_migration.py (run tests or search to confirm).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 8a0c7cc5-f6e1-4abe-b2b2-bed8c2f9c85b
📒 Files selected for processing (3)
cmd/llmisvc/main.gogo.modtest/e2e/llmisvc/test_storage_version_migration.py
✅ Files skipped from review due to trivial changes (1)
- go.mod
9461c97 to
22bae41
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: asanzgom, bartoszmajsak, pierDipi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/override ci/prow/e2e-llm-inference-service Waiting for #1196 to land. Tests are passing on Kind |
|
@bartoszmajsak: Overrode contexts on behalf of bartoszmajsak: ci/prow/e2e-llm-inference-service DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
81121b0
into
opendatahub-io:master
The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a LeaderElectionRunnable so it executes only after webhooks and caches are fully ready. Replaces the short DefaultBackoff retry (~1.5s) with a generous exponential backoff (~2min window) to handle Service endpoint propagation delays on platforms like OpenShift. Port of opendatahub-io#1251
Fixes a startup crash in the `LLMInferenceService` controller caused by storage version migration running before the webhook server has registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a `LeaderElectionRunnable` so it executes only after webhooks and caches are fully ready, uses exponential backoff to handle Service endpoint propagation delays. Port of opendatahub-io#1251
Fixes a startup crash in the `LLMInferenceService` controller caused by storage version migration running before the webhook server has registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits. Moves migration into a `LeaderElectionRunnable` so it executes only after webhooks and caches are fully ready, uses exponential backoff to handle Service endpoint propagation delays. Port of opendatahub-io#1251 Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Cherry-pick of opendatahub-io#1251 Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344) Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
…ing (#4276) Cherry-pick of opendatahub-io#1251 Fixes [RHOAIENG-54344](https://redhat.atlassian.net/browse/RHOAIENG-54344) Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
What this PR does / why we need it:
The controller was crashing on startup because storage version migration ran before the webhook server had registered Service endpoints. The Knative migrator patches each resource with an empty merge patch, which triggers validating webhooks - but with no endpoints available, the API server rejects every patch and the controller exits.
Moves migration into a LeaderElectionRunnable and it executes only after webhooks and caches are fully ready, with an exponential backoff to handle Service endpoint propagation delays.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes RHOAIENG-54344
Summary by CodeRabbit
Bug Fixes
Tests