fix(llmisvc): prevent migration deadlock during controller startup by asanzgom · Pull Request #1250 · opendatahub-io/kserve

asanzgom · 2026-03-19T15:34:54Z

Moved resource migration to run after webhook server is ready, preventing circular dependency that caused CrashLoopBackOff during startup with existing resources.

The fix includes:

Webhook readiness wait (5s) to ensure service endpoints are available
Retry logic with exponential backoff (3 attempts: 1s, 2s, 4s)
Non-fatal error handling to allow controller to serve new requests

Fixes RHOAIENG-54344

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Feature/Issue validation/testing:

Test A
Test B
Logs

Special notes for your reviewer:

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?
Have you linked the JIRA issue(s) to this PR?

Release note:

Summary by CodeRabbit

Improvements
- Migrations now execute asynchronously with automatic retry logic and exponential backoff for each attempt.
- Service continues operating even when migrations encounter failures; errors are logged for review.
- Added initialization delay before migration execution to allow system components to stabilize.

Moved resource migration to run after webhook server is ready, preventing circular dependency that caused CrashLoopBackOff during startup with existing resources. The fix includes: - Webhook readiness wait (5s) to ensure service endpoints are available - Retry logic with exponential backoff (3 attempts: 1s, 2s, 4s) - Non-fatal error handling to allow controller to serve new requests Fixes RHOAIENG-54344

openshift-ci · 2026-03-19T15:35:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: asanzgom
Once this PR has been reviewed and has the lgtm label, please assign brettmthompson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-19T15:35:18Z

📝 Walkthrough

Walkthrough

Migration execution shifted from synchronous startup to asynchronous post-start runnable via mgr.Add(). The implementation defers migrations until after manager startup with a fixed 5-second delay, concurrently processes each GroupResource via goroutines under errgroup, retries failed migrations up to 3 times with exponential backoff (1s, 2s, 4s), and logs migration errors without terminating the process.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security & Quality Issues

Fixed delay for webhook readiness creates race condition — time.Sleep(5 * time.Second) is not a synchronization mechanism. The webhook server may not be ready after 5 seconds, or may be ready after 2 seconds, creating unpredictable timing. This violates CWE-367 (Timing with Hard-Coded Expectations). Actionable: Implement proper health checks or lifecycle hooks instead of fixed delays.

Silent failure on migration errors — Returning nil from the runnable despite migration failures masks critical state. If migrations are prerequisite to correct operation, the service becomes available in an inconsistent state without operators knowing. CWE-391 (Unchecked Error Condition). Actionable: Define which migration errors are fatal vs. recoverable; consider blocking service readiness until critical migrations succeed.

Concurrent resource migration safety unchecked — Launching goroutines per GroupResource assumes mutations are safe under concurrent execution. If resources share state or backends, concurrent writes during migration could corrupt data or violate invariants. Actionable: Document concurrent mutation guarantees or add per-resource locking if needed.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: moving resource migration to after webhook startup to prevent deadlock during controller initialization.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/llmisvc/main.go`:
- Around line 282-287: The fixed time.Sleep should be replaced with an actual
readiness check: remove the time.Sleep(5 * time.Second) after the
setupLog.Info("waiting for webhook server readiness") call and instead poll the
cluster to confirm the webhook Service has ready endpoints (via Endpoints or
EndpointSlice using the in-cluster k8s client) or attempt a TCP connect to the
webhook Service DNS:port, retrying until success or ctx is done; ensure the
polling respects ctx cancellation/timeout and returns an error if unsuccessful
so the subsequent migration/startup code (the manager startup and migration
calls) only proceed once the webhook is reachable.
- Around line 299-306: The retry loop calling migrator.Migrate(ctx, gr)
currently always sleeps after each attempt using time.Sleep, including after the
final attempt, and the sleep ignores ctx cancellation; change the logic so you
only wait between attempts (i.e., skip sleeping after the last attempt) and
replace time.Sleep with a cancellable wait that returns early on ctx.Done() (for
example using time.NewTimer/backoff timer and select on timer.C and ctx.Done());
ensure you preserve the exponential backoff calculation (1s, 2s, 4s) and still
set lastErr and log via setupLog when a retry is scheduled.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0d1073af-bbb9-45ab-84cf-843eb114b804

📥 Commits

Reviewing files that changed from the base of the PR and between f5b52dd and 5d810dd.

📒 Files selected for processing (1)

cmd/llmisvc/main.go

coderabbitai · 2026-03-19T15:41:59Z

cmd/llmisvc/main.go

+		// Wait for webhook server to be fully ready and service endpoints to be updated
+		// This prevents a race condition where migration starts before the webhook
+		// service has endpoints available in Kubernetes
+		setupLog.Info("waiting for webhook server readiness")
+		time.Sleep(5 * time.Second)
+


⚠️ Potential issue | 🟠 Major

Replace the fixed sleep with a real webhook readiness check.

Line 286 is still a timing guess, not proof that the apiserver can reach the conversion webhook through its Service. On slower clusters, the Service can take longer than 5 seconds to expose a ready endpoint; when that happens, the retries can all run before conversion is actually reachable, and Lines 315-319 then hide the failed migration while the manager stays healthy. Wait on a real readiness signal—e.g. the webhook Service via in-cluster DNS or its Endpoints/EndpointSlice, with a timeout tied to ctx—instead of time.Sleep(5 * time.Second).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cmd/llmisvc/main.go` around lines 282 - 287, The fixed time.Sleep should be replaced with an actual readiness check: remove the time.Sleep(5 * time.Second) after the setupLog.Info("waiting for webhook server readiness") call and instead poll the cluster to confirm the webhook Service has ready endpoints (via Endpoints or EndpointSlice using the in-cluster k8s client) or attempt a TCP connect to the webhook Service DNS:port, retrying until success or ctx is done; ensure the polling respects ctx cancellation/timeout and returns an error if unsuccessful so the subsequent migration/startup code (the manager startup and migration calls) only proceed once the webhook is reachable.

coderabbitai · 2026-03-19T15:41:59Z

cmd/llmisvc/main.go

+				for attempt := 0; attempt < 3; attempt++ {
+					if err := migrator.Migrate(ctx, gr); err != nil {
+						lastErr = err
+						setupLog.Info("migration attempt failed, will retry",
+							"resource", gr, "attempt", attempt+1, "error", err.Error())
+						// Exponential backoff: 1s, 2s, 4s
+						backoff := time.Second * time.Duration(1<<uint(attempt))
+						time.Sleep(backoff)


⚠️ Potential issue | 🟡 Minor

Make the retry backoff cancellable and skip the last sleep.

Lines 305-306 always wait, including after the third and final attempt, so permanent failures incur an unnecessary extra 4-second stall. time.Sleep also ignores ctx.Done(), which delays shutdown or leader handoff if the manager is stopping during backoff.

Suggested fix

- for attempt := 0; attempt < 3; attempt++ { + const maxAttempts = 3 + for attempt := 0; attempt < maxAttempts; attempt++ { if err := migrator.Migrate(ctx, gr); err != nil { lastErr = err setupLog.Info("migration attempt failed, will retry", "resource", gr, "attempt", attempt+1, "error", err.Error()) - // Exponential backoff: 1s, 2s, 4s - backoff := time.Second * time.Duration(1<<uint(attempt)) - time.Sleep(backoff) + if attempt+1 < maxAttempts { + backoff := time.Second * time.Duration(1<<uint(attempt)) + select { + case <-time.After(backoff): + case <-ctx.Done(): + return nil + } + } continue }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cmd/llmisvc/main.go` around lines 299 - 306, The retry loop calling migrator.Migrate(ctx, gr) currently always sleeps after each attempt using time.Sleep, including after the final attempt, and the sleep ignores ctx cancellation; change the logic so you only wait between attempts (i.e., skip sleeping after the last attempt) and replace time.Sleep with a cancellable wait that returns early on ctx.Done() (for example using time.NewTimer/backoff timer and select on timer.C and ctx.Done()); ensure you preserve the exponential backoff calculation (1s, 2s, 4s) and still set lastErr and log via setupLog when a retry is scheduled.

bartoszmajsak · 2026-03-19T16:02:35Z

Duplicates #1251

github-project-automation bot added this to ODH Model Serving Planning Mar 19, 2026

github-project-automation bot moved this to New/Backlog in ODH Model Serving Planning Mar 19, 2026

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

bartoszmajsak closed this Mar 19, 2026

github-project-automation bot moved this from New/Backlog to Done in ODH Model Serving Planning Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llmisvc): prevent migration deadlock during controller startup#1250

fix(llmisvc): prevent migration deadlock during controller startup#1250
asanzgom wants to merge 1 commit intoopendatahub-io:masterfrom
asanzgom:bugfix/RHOAIENG-54344-migration-deadlock

asanzgom commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

coderabbitai bot commented Mar 19, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Security & Quality Issues

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

bartoszmajsak commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asanzgom commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Security & Quality Issues

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

bartoszmajsak commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asanzgom commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading