Replies: 1 comment 8 replies
-
|
I think you could always adjust the CONCOURSE_GC_FAILED_GRACE_PERIOD as the default is 5 days. The idea was to debug nasty side_effects even after long weekends if memory serves me right. If your usual jobs are not quite resource heavy you could always make the max_containers count higher, or if the opposite is true, then scale a little horizontally to have more workers, instead of vertically upscaled bigger machines. This really depends on your setup and use case. |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I think this is a bug, but I don't have a step-by-step repro for you at this time.
Problem
One of my colleagues runs a very wide pipeline in our Concourse cluster: there are at least 575 jobs in one pipeline. (That is not a typo. Almost six hundred.) Today, as luck would have it, all of these jobs failed on a common bug. Soon afterwards, the Concourse cluster refused to execute any builds.
No pipeline could build -- or even run resource checks -- in this state.
I suspect GC was deferred because these builds went red.
Separately, there is no way for the operator to manually delete a container -- see #346 -- so it is not so easy to recover from this failure mode. I ultimately bounced our workers as a work-around. (Our workers 'forget' their containers when we bounce them. I don't know whether that is a separate bug; I guess it turned out to be a 'feature' today!)
Suggestion
fly hijackis a cool feature. I understand why you hang on to these old containers. Nevertheless, I don't think it is reasonable to prioritise a debugging feature over foundational task scheduling. A cluster that cannot run new builds is a useless cluster.Would it be possible to add container pre-emption to the ATC scheduler?
If, when starting a new container, all workers are found to be at or near their container limit, the ATC should immediately GC some old containers that are technically eligible for GC. (Any container that is not being used to actively execute stuff.) Prioritise builds. Sacrifice
fly hijackif we are at the limit.Beta Was this translation helpful? Give feedback.
All reactions