fix(llmisvc): prevent migration deadlock during controller startup#1250
fix(llmisvc): prevent migration deadlock during controller startup#1250asanzgom wants to merge 1 commit intoopendatahub-io:masterfrom
Conversation
Moved resource migration to run after webhook server is ready, preventing circular dependency that caused CrashLoopBackOff during startup with existing resources. The fix includes: - Webhook readiness wait (5s) to ensure service endpoints are available - Retry logic with exponential backoff (3 attempts: 1s, 2s, 4s) - Non-fatal error handling to allow controller to serve new requests Fixes RHOAIENG-54344
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: asanzgom The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughMigration execution shifted from synchronous startup to asynchronous post-start runnable via Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Security & Quality IssuesFixed delay for webhook readiness creates race condition — Silent failure on migration errors — Returning Concurrent resource migration safety unchecked — Launching goroutines per GroupResource assumes mutations are safe under concurrent execution. If resources share state or backends, concurrent writes during migration could corrupt data or violate invariants. Actionable: Document concurrent mutation guarantees or add per-resource locking if needed. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/llmisvc/main.go`:
- Around line 282-287: The fixed time.Sleep should be replaced with an actual
readiness check: remove the time.Sleep(5 * time.Second) after the
setupLog.Info("waiting for webhook server readiness") call and instead poll the
cluster to confirm the webhook Service has ready endpoints (via Endpoints or
EndpointSlice using the in-cluster k8s client) or attempt a TCP connect to the
webhook Service DNS:port, retrying until success or ctx is done; ensure the
polling respects ctx cancellation/timeout and returns an error if unsuccessful
so the subsequent migration/startup code (the manager startup and migration
calls) only proceed once the webhook is reachable.
- Around line 299-306: The retry loop calling migrator.Migrate(ctx, gr)
currently always sleeps after each attempt using time.Sleep, including after the
final attempt, and the sleep ignores ctx cancellation; change the logic so you
only wait between attempts (i.e., skip sleeping after the last attempt) and
replace time.Sleep with a cancellable wait that returns early on ctx.Done() (for
example using time.NewTimer/backoff timer and select on timer.C and ctx.Done());
ensure you preserve the exponential backoff calculation (1s, 2s, 4s) and still
set lastErr and log via setupLog when a retry is scheduled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 0d1073af-bbb9-45ab-84cf-843eb114b804
📒 Files selected for processing (1)
cmd/llmisvc/main.go
| // Wait for webhook server to be fully ready and service endpoints to be updated | ||
| // This prevents a race condition where migration starts before the webhook | ||
| // service has endpoints available in Kubernetes | ||
| setupLog.Info("waiting for webhook server readiness") | ||
| time.Sleep(5 * time.Second) | ||
|
|
There was a problem hiding this comment.
Replace the fixed sleep with a real webhook readiness check.
Line 286 is still a timing guess, not proof that the apiserver can reach the conversion webhook through its Service. On slower clusters, the Service can take longer than 5 seconds to expose a ready endpoint; when that happens, the retries can all run before conversion is actually reachable, and Lines 315-319 then hide the failed migration while the manager stays healthy. Wait on a real readiness signal—e.g. the webhook Service via in-cluster DNS or its Endpoints/EndpointSlice, with a timeout tied to ctx—instead of time.Sleep(5 * time.Second).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/llmisvc/main.go` around lines 282 - 287, The fixed time.Sleep should be
replaced with an actual readiness check: remove the time.Sleep(5 * time.Second)
after the setupLog.Info("waiting for webhook server readiness") call and instead
poll the cluster to confirm the webhook Service has ready endpoints (via
Endpoints or EndpointSlice using the in-cluster k8s client) or attempt a TCP
connect to the webhook Service DNS:port, retrying until success or ctx is done;
ensure the polling respects ctx cancellation/timeout and returns an error if
unsuccessful so the subsequent migration/startup code (the manager startup and
migration calls) only proceed once the webhook is reachable.
| for attempt := 0; attempt < 3; attempt++ { | ||
| if err := migrator.Migrate(ctx, gr); err != nil { | ||
| lastErr = err | ||
| setupLog.Info("migration attempt failed, will retry", | ||
| "resource", gr, "attempt", attempt+1, "error", err.Error()) | ||
| // Exponential backoff: 1s, 2s, 4s | ||
| backoff := time.Second * time.Duration(1<<uint(attempt)) | ||
| time.Sleep(backoff) |
There was a problem hiding this comment.
Make the retry backoff cancellable and skip the last sleep.
Lines 305-306 always wait, including after the third and final attempt, so permanent failures incur an unnecessary extra 4-second stall. time.Sleep also ignores ctx.Done(), which delays shutdown or leader handoff if the manager is stopping during backoff.
Suggested fix
- for attempt := 0; attempt < 3; attempt++ {
+ const maxAttempts = 3
+ for attempt := 0; attempt < maxAttempts; attempt++ {
if err := migrator.Migrate(ctx, gr); err != nil {
lastErr = err
setupLog.Info("migration attempt failed, will retry",
"resource", gr, "attempt", attempt+1, "error", err.Error())
- // Exponential backoff: 1s, 2s, 4s
- backoff := time.Second * time.Duration(1<<uint(attempt))
- time.Sleep(backoff)
+ if attempt+1 < maxAttempts {
+ backoff := time.Second * time.Duration(1<<uint(attempt))
+ select {
+ case <-time.After(backoff):
+ case <-ctx.Done():
+ return nil
+ }
+ }
continue
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/llmisvc/main.go` around lines 299 - 306, The retry loop calling
migrator.Migrate(ctx, gr) currently always sleeps after each attempt using
time.Sleep, including after the final attempt, and the sleep ignores ctx
cancellation; change the logic so you only wait between attempts (i.e., skip
sleeping after the last attempt) and replace time.Sleep with a cancellable wait
that returns early on ctx.Done() (for example using time.NewTimer/backoff timer
and select on timer.C and ctx.Done()); ensure you preserve the exponential
backoff calculation (1s, 2s, 4s) and still set lastErr and log via setupLog when
a retry is scheduled.
|
Duplicates #1251 |
Moved resource migration to run after webhook server is ready, preventing circular dependency that caused CrashLoopBackOff during startup with existing resources.
The fix includes:
Fixes RHOAIENG-54344
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Feature/Issue validation/testing:
Test A
Test B
Logs
Special notes for your reviewer:
Checklist:
Release note:
Summary by CodeRabbit