-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
After applying a payload to the slurm cluster, the operator creates the daemonset for slurmabler pods. However, if the pod crashes or restarts, it will error loop because the daemonset already exists.
To Reproduce
Steps to reproduce the behavior:
- Install Slik
- Apply either payload
- Let the slurmabler pods be created
- Delete the operator pod, allowing the deployment to recreate it, and check the logs for the error loop.
Expected behavior
It should handle errors gracefully, or if there is an issue where the daemonset needs to be created, then the operator should just delete and then recreate the daemonset.
Additional context
Deleting the daemonset and restarting the operator pod will fix the problem but when you upgrade a cluster pods will be moved around during the rolling update, therefore any cluster upgrade will break the slurm operator.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working