Skip to content

[BUG] - Operator Pod Error Loops if the Pod has to recreate #22

@odellem

Description

@odellem

Describe the bug
After applying a payload to the slurm cluster, the operator creates the daemonset for slurmabler pods. However, if the pod crashes or restarts, it will error loop because the daemonset already exists.

To Reproduce
Steps to reproduce the behavior:

  1. Install Slik
  2. Apply either payload
  3. Let the slurmabler pods be created
  4. Delete the operator pod, allowing the deployment to recreate it, and check the logs for the error loop.

Expected behavior
It should handle errors gracefully, or if there is an issue where the daemonset needs to be created, then the operator should just delete and then recreate the daemonset.

Additional context
Deleting the daemonset and restarting the operator pod will fix the problem but when you upgrade a cluster pods will be moved around during the rolling update, therefore any cluster upgrade will break the slurm operator.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions