Skip to content

Conversation

@syedazeez337
Copy link

The eraser-controller-manager creates multiple ImageJobs when it restarts, causing an OOM cascade where each ImageJob spawns pods on every node, overwhelming the API server and leading to further OOM events.

This fix adds synchronization to the 'first-reconcile' mechanism:

  • Use mutex to prevent concurrent first-reconcile executions
  • Track completion state to avoid redundant job creation
  • Properly detect existing running jobs before cleanup
  • Re-list jobs after cleanup for accurate state
  • Only create new job if no running jobs exist

Testing confirms the fix:

  • 10 concurrent controller restarts now create only 1 ImageJob (vs 10 before)
  • No redundant pod creation
  • No API server pressure
  • No OOM cascade

Resolves: #1169

The eraser-controller-manager creates multiple ImageJobs when it restarts,
causing an OOM cascade where each ImageJob spawns pods on every node,
overwhelming the API server and leading to further OOM events.

This fix adds synchronization to the 'first-reconcile' mechanism:
- Use mutex to prevent concurrent first-reconcile executions
- Track completion state to avoid redundant job creation
- Properly detect existing running jobs before cleanup
- Re-list jobs after cleanup for accurate state
- Only create new job if no running jobs exist

Testing confirms the fix:
- 10 concurrent controller restarts now create only 1 ImageJob (vs 10 before)
- No redundant pod creation
- No API server pressure
- No OOM cascade

Resolves: eraser-dev#1169
Signed-off-by: Azeez Syed <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] eraser-controller-manager OOM creating multiple imagejob each time restarts

1 participant