Controller is not launching more pods even though there's a lot of jobs queued #1492
Unanswered
bmbferreira
asked this question in
Questions
Replies: 2 comments 1 reply
-
Controller logs would help |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @toast-gear! I'll try to get them again when I see this behaviour happening again. Meanwhile, can this issue be related with this? I think my configuration for the |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I'm having a bad time trying to understand why the controller is not launching more pods even though I have an enormous queue of jobs waiting to be executed.
The situation is, I configured the autoscaling based on the recommended workflow_job webhook and I'm setting a minumum of 1 replica and a max of ten. The configuration is this:
What is happening is that we have some jobs that are getting stuck with an external tool and sometimes the pods stay running for like 2 hours before it completes. However, having these long running jobs that are indeed a problem, I never see the number of pods reaching the max number of replicas that I configured (20). I get like 3/4 pods running for a couple of hours and an enormous queue of jobs queued. If I delete manually these long running jobs then the queue starts to recover.
What am I missing? I was expecting to have the max number of replicas being used even if I have other jobs running for a long time.
Also, the infrastructure is not a problem because the cluster autoscaler is working fine and is launching new nodes when new pods are launched, the problem here seems to be the controller because I don't see new pods starting for the queued jobs.
Thanks in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions