Controller is not launching more pods even though there's a lot of jobs queued #1492

bmbferreira · 2022-05-27T00:00:49Z

bmbferreira
May 27, 2022

Hi! I'm having a bad time trying to understand why the controller is not launching more pods even though I have an enormous queue of jobs waiting to be executed.
The situation is, I configured the autoscaling based on the recommended workflow_job webhook and I'm setting a minumum of 1 replica and a max of ten. The configuration is this:

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: cypress
  namespace: github-runner
spec:
  maxReplicas: 20
  minReplicas: 1
  scaleTargetRef:
    name: cypress
  scaleUpTriggers:
  - duration: 5m
    githubEvent: {}
  scheduledOverrides:
  - endTime: "2022-05-16T05:00:00+00:00"
    minReplicas: 0
    recurrenceRule:
      frequency: Weekly
    startTime: "2022-05-13T22:00:00+00:00"
  - endTime: "2022-05-17T05:00:00+00:00"
    minReplicas: 0
    recurrenceRule:
      frequency: Daily
    startTime: "2022-05-16T22:00:00+00:00"

What is happening is that we have some jobs that are getting stuck with an external tool and sometimes the pods stay running for like 2 hours before it completes. However, having these long running jobs that are indeed a problem, I never see the number of pods reaching the max number of replicas that I configured (20). I get like 3/4 pods running for a couple of hours and an enormous queue of jobs queued. If I delete manually these long running jobs then the queue starts to recover.

What am I missing? I was expecting to have the max number of replicas being used even if I have other jobs running for a long time.

Also, the infrastructure is not a problem because the cluster autoscaler is working fine and is launching new nodes when new pods are launched, the problem here seems to be the controller because I don't see new pods starting for the queued jobs.

Thanks in advance for your help!

toast-gear · 2022-05-27T09:55:29Z

toast-gear
May 27, 2022
Maintainer

Controller logs would help

0 replies

bmbferreira · 2022-05-30T00:21:11Z

bmbferreira
May 30, 2022
Author

Hi @toast-gear! I'll try to get them again when I see this behaviour happening again. Meanwhile, can this issue be related with this? I think my configuration for the duration might not be the best but I'm having a hard time trying to understand how duration works in this context 😕 If I have jobs taking 1 hour to execute, should the duration be 60m? Can you provide some examples to understand how duration works and which value should I set? Thank you

1 reply

mumoshu May 31, 2022
Maintainer

@bmbferreira The scale trigger duration is the duration since the webhook-based autoscaler added a replica until the added replica is expired.

Generally, 60m + some scale delay would work well. The scale delay should take a few things into account- first, how much time it takes for your cluster to create the pod for a added replica, plus how much time actions/runner agent takes to get up, how long github takes to assign a job onto the agent, and how long the agent takes to actually start running the job.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controller is not launching more pods even though there's a lot of jobs queued #1492

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Controller is not launching more pods even though there's a lot of jobs queued #1492

Uh oh!

Uh oh!

bmbferreira May 27, 2022

Replies: 2 comments · 1 reply

Uh oh!

toast-gear May 27, 2022 Maintainer

Uh oh!

bmbferreira May 30, 2022 Author

Uh oh!

mumoshu May 31, 2022 Maintainer

bmbferreira
May 27, 2022

Replies: 2 comments 1 reply

toast-gear
May 27, 2022
Maintainer

bmbferreira
May 30, 2022
Author

mumoshu May 31, 2022
Maintainer