fix(imagecollector): prevent OOM cascade on controller restart #1206

syedazeez337 · 2025-12-29T12:52:30Z

The eraser-controller-manager creates multiple ImageJobs when it restarts, causing an OOM cascade where each ImageJob spawns pods on every node, overwhelming the API server and leading to further OOM events.

This fix adds synchronization to the 'first-reconcile' mechanism:

Use mutex to prevent concurrent first-reconcile executions
Track completion state to avoid redundant job creation
Properly detect existing running jobs before cleanup
Re-list jobs after cleanup for accurate state
Only create new job if no running jobs exist

Testing confirms the fix:

10 concurrent controller restarts now create only 1 ImageJob (vs 10 before)
No redundant pod creation
No API server pressure
No OOM cascade

Resolves: #1169

The eraser-controller-manager creates multiple ImageJobs when it restarts, causing an OOM cascade where each ImageJob spawns pods on every node, overwhelming the API server and leading to further OOM events. This fix adds synchronization to the 'first-reconcile' mechanism: - Use mutex to prevent concurrent first-reconcile executions - Track completion state to avoid redundant job creation - Properly detect existing running jobs before cleanup - Re-list jobs after cleanup for accurate state - Only create new job if no running jobs exist Testing confirms the fix: - 10 concurrent controller restarts now create only 1 ImageJob (vs 10 before) - No redundant pod creation - No API server pressure - No OOM cascade Resolves: eraser-dev#1169 Signed-off-by: Azeez Syed <[email protected]>

syedazeez337 requested review from ashnamehrotra, pmengelbert and sozercan as code owners December 29, 2025 12:52

syedazeez337 force-pushed the fix/oom-cascade-issue-1169 branch from adb254d to c3fef1f Compare December 29, 2025 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(imagecollector): prevent OOM cascade on controller restart #1206

fix(imagecollector): prevent OOM cascade on controller restart #1206

Uh oh!

syedazeez337 commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(imagecollector): prevent OOM cascade on controller restart #1206

Are you sure you want to change the base?

fix(imagecollector): prevent OOM cascade on controller restart #1206

Uh oh!

Conversation

syedazeez337 commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant