You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The agent autoscaling group should never rebalance availability zones
By default, if the two availability zones in the agent ASG become
significantly unbalanced, the ASG will terminate some instances in the
larger AZ and start some new ones in the smaller AZ.
That's helpful for an ASG serving web requests, but it's not very
helpful for our agent-shared workloads - the instance termination can
disrupt running jobs. With the new lambda based scaler, each instance is
responsible for terminating itself and the AZs become unbalanced very
easily. That's not much of a problem though - the larger AZ is likely to
reduce in size relatively soon, and subsequent scale-outs will restore
the balance (for a while).
Sadly there's no way to suspend the AZRebalance process via
cloudformation, so I held my nose and implemented it using a custom
resource. It's not as ugly as I feared, mainly because it's possible to
provide the required lambda function inline.
An alternative approach would be to have our buildkite-agent-scaler
lambda check the AZRebalance status each time it loops and suspend the
process if required. I thought this approach might be good enough for
now, and we could try the scaler option down the track if we need to.
Some resources I found useful:
1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/
2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05
3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html
4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
0 commit comments