Skip to content

fix(llmisvc): prevent migration deadlock during controller startup#1250

Closed
asanzgom wants to merge 1 commit intoopendatahub-io:masterfrom
asanzgom:bugfix/RHOAIENG-54344-migration-deadlock
Closed

fix(llmisvc): prevent migration deadlock during controller startup#1250
asanzgom wants to merge 1 commit intoopendatahub-io:masterfrom
asanzgom:bugfix/RHOAIENG-54344-migration-deadlock

Conversation

@asanzgom
Copy link

@asanzgom asanzgom commented Mar 19, 2026

Moved resource migration to run after webhook server is ready, preventing circular dependency that caused CrashLoopBackOff during startup with existing resources.

The fix includes:

  • Webhook readiness wait (5s) to ensure service endpoints are available
  • Retry logic with exponential backoff (3 attempts: 1s, 2s, 4s)
  • Non-fatal error handling to allow controller to serve new requests

Fixes RHOAIENG-54344

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Feature/Issue validation/testing:

  • Test A

  • Test B

  • Logs

Special notes for your reviewer:

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?
  • Have you linked the JIRA issue(s) to this PR?

Release note:


Summary by CodeRabbit

  • Improvements
    • Migrations now execute asynchronously with automatic retry logic and exponential backoff for each attempt.
    • Service continues operating even when migrations encounter failures; errors are logged for review.
    • Added initialization delay before migration execution to allow system components to stabilize.

Moved resource migration to run after webhook server is ready, preventing
circular dependency that caused CrashLoopBackOff during startup with
existing resources.

The fix includes:
- Webhook readiness wait (5s) to ensure service endpoints are available
- Retry logic with exponential backoff (3 attempts: 1s, 2s, 4s)
- Non-fatal error handling to allow controller to serve new requests

Fixes RHOAIENG-54344
@openshift-ci
Copy link

openshift-ci bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: asanzgom
Once this PR has been reviewed and has the lgtm label, please assign brettmthompson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

Migration execution shifted from synchronous startup to asynchronous post-start runnable via mgr.Add(). The implementation defers migrations until after manager startup with a fixed 5-second delay, concurrently processes each GroupResource via goroutines under errgroup, retries failed migrations up to 3 times with exponential backoff (1s, 2s, 4s), and logs migration errors without terminating the process.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security & Quality Issues

Fixed delay for webhook readiness creates race conditiontime.Sleep(5 * time.Second) is not a synchronization mechanism. The webhook server may not be ready after 5 seconds, or may be ready after 2 seconds, creating unpredictable timing. This violates CWE-367 (Timing with Hard-Coded Expectations). Actionable: Implement proper health checks or lifecycle hooks instead of fixed delays.

Silent failure on migration errors — Returning nil from the runnable despite migration failures masks critical state. If migrations are prerequisite to correct operation, the service becomes available in an inconsistent state without operators knowing. CWE-391 (Unchecked Error Condition). Actionable: Define which migration errors are fatal vs. recoverable; consider blocking service readiness until critical migrations succeed.

Concurrent resource migration safety unchecked — Launching goroutines per GroupResource assumes mutations are safe under concurrent execution. If resources share state or backends, concurrent writes during migration could corrupt data or violate invariants. Actionable: Document concurrent mutation guarantees or add per-resource locking if needed.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: moving resource migration to after webhook startup to prevent deadlock during controller initialization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/llmisvc/main.go`:
- Around line 282-287: The fixed time.Sleep should be replaced with an actual
readiness check: remove the time.Sleep(5 * time.Second) after the
setupLog.Info("waiting for webhook server readiness") call and instead poll the
cluster to confirm the webhook Service has ready endpoints (via Endpoints or
EndpointSlice using the in-cluster k8s client) or attempt a TCP connect to the
webhook Service DNS:port, retrying until success or ctx is done; ensure the
polling respects ctx cancellation/timeout and returns an error if unsuccessful
so the subsequent migration/startup code (the manager startup and migration
calls) only proceed once the webhook is reachable.
- Around line 299-306: The retry loop calling migrator.Migrate(ctx, gr)
currently always sleeps after each attempt using time.Sleep, including after the
final attempt, and the sleep ignores ctx cancellation; change the logic so you
only wait between attempts (i.e., skip sleeping after the last attempt) and
replace time.Sleep with a cancellable wait that returns early on ctx.Done() (for
example using time.NewTimer/backoff timer and select on timer.C and ctx.Done());
ensure you preserve the exponential backoff calculation (1s, 2s, 4s) and still
set lastErr and log via setupLog when a retry is scheduled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0d1073af-bbb9-45ab-84cf-843eb114b804

📥 Commits

Reviewing files that changed from the base of the PR and between f5b52dd and 5d810dd.

📒 Files selected for processing (1)
  • cmd/llmisvc/main.go

Comment on lines +282 to +287
// Wait for webhook server to be fully ready and service endpoints to be updated
// This prevents a race condition where migration starts before the webhook
// service has endpoints available in Kubernetes
setupLog.Info("waiting for webhook server readiness")
time.Sleep(5 * time.Second)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace the fixed sleep with a real webhook readiness check.

Line 286 is still a timing guess, not proof that the apiserver can reach the conversion webhook through its Service. On slower clusters, the Service can take longer than 5 seconds to expose a ready endpoint; when that happens, the retries can all run before conversion is actually reachable, and Lines 315-319 then hide the failed migration while the manager stays healthy. Wait on a real readiness signal—e.g. the webhook Service via in-cluster DNS or its Endpoints/EndpointSlice, with a timeout tied to ctx—instead of time.Sleep(5 * time.Second).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/llmisvc/main.go` around lines 282 - 287, The fixed time.Sleep should be
replaced with an actual readiness check: remove the time.Sleep(5 * time.Second)
after the setupLog.Info("waiting for webhook server readiness") call and instead
poll the cluster to confirm the webhook Service has ready endpoints (via
Endpoints or EndpointSlice using the in-cluster k8s client) or attempt a TCP
connect to the webhook Service DNS:port, retrying until success or ctx is done;
ensure the polling respects ctx cancellation/timeout and returns an error if
unsuccessful so the subsequent migration/startup code (the manager startup and
migration calls) only proceed once the webhook is reachable.

Comment on lines +299 to +306
for attempt := 0; attempt < 3; attempt++ {
if err := migrator.Migrate(ctx, gr); err != nil {
lastErr = err
setupLog.Info("migration attempt failed, will retry",
"resource", gr, "attempt", attempt+1, "error", err.Error())
// Exponential backoff: 1s, 2s, 4s
backoff := time.Second * time.Duration(1<<uint(attempt))
time.Sleep(backoff)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the retry backoff cancellable and skip the last sleep.

Lines 305-306 always wait, including after the third and final attempt, so permanent failures incur an unnecessary extra 4-second stall. time.Sleep also ignores ctx.Done(), which delays shutdown or leader handoff if the manager is stopping during backoff.

Suggested fix
-				for attempt := 0; attempt < 3; attempt++ {
+				const maxAttempts = 3
+				for attempt := 0; attempt < maxAttempts; attempt++ {
 					if err := migrator.Migrate(ctx, gr); err != nil {
 						lastErr = err
 						setupLog.Info("migration attempt failed, will retry",
 							"resource", gr, "attempt", attempt+1, "error", err.Error())
-						// Exponential backoff: 1s, 2s, 4s
-						backoff := time.Second * time.Duration(1<<uint(attempt))
-						time.Sleep(backoff)
+						if attempt+1 < maxAttempts {
+							backoff := time.Second * time.Duration(1<<uint(attempt))
+							select {
+							case <-time.After(backoff):
+							case <-ctx.Done():
+								return nil
+							}
+						}
 						continue
 					}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/llmisvc/main.go` around lines 299 - 306, The retry loop calling
migrator.Migrate(ctx, gr) currently always sleeps after each attempt using
time.Sleep, including after the final attempt, and the sleep ignores ctx
cancellation; change the logic so you only wait between attempts (i.e., skip
sleeping after the last attempt) and replace time.Sleep with a cancellable wait
that returns early on ctx.Done() (for example using time.NewTimer/backoff timer
and select on timer.C and ctx.Done()); ensure you preserve the exponential
backoff calculation (1s, 2s, 4s) and still set lastErr and log via setupLog when
a retry is scheduled.

@bartoszmajsak
Copy link

Duplicates #1251

@github-project-automation github-project-automation bot moved this from New/Backlog to Done in ODH Model Serving Planning Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants