Skip to content

Commit 7f880f9

Browse files
committed
The agent autoscaling group should never rebalance availability zones
By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ. That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while). Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline. An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to. Some resources I found useful: 1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/ 2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05 3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html 4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
1 parent b65b7be commit 7f880f9

File tree

1 file changed

+54
-0
lines changed

1 file changed

+54
-0
lines changed

templates/aws-stack.yml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -939,6 +939,59 @@ Resources:
939939
AutoScalingReplacingUpdate:
940940
WillReplace: true
941941

942+
AsgProcessSuspenderRole:
943+
Type: AWS::IAM::Role
944+
Properties:
945+
AssumeRolePolicyDocument:
946+
Version: 2012-10-17
947+
Statement:
948+
- Action: ['sts:AssumeRole']
949+
Effect: Allow
950+
Principal:
951+
Service: [lambda.amazonaws.com]
952+
ManagedPolicyArns:
953+
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
954+
Policies:
955+
- PolicyName: AsgProcessModification
956+
PolicyDocument:
957+
Version: 2012-10-17
958+
Statement:
959+
- Effect: Allow
960+
Action:
961+
- 'autoscaling:SuspendProcesses'
962+
Resource: '*'
963+
964+
AzRebalancingSuspenderFunction:
965+
Type: AWS::Lambda::Function
966+
Properties:
967+
Description: 'Disables AZ Rebalancing on the agent ASG'
968+
Code:
969+
ZipFile: |
970+
import cfnresponse
971+
import boto3
972+
def handler(event, context):
973+
try:
974+
if event['RequestType'] == 'Delete':
975+
cfnresponse.send(event, context, cfnresponse.SUCCESS, {}, "CustomResourcePhysicalID")
976+
else:
977+
client = boto3.client('autoscaling')
978+
props = event['ResourceProperties']
979+
response = client.suspend_processes(AutoScalingGroupName=props['AutoScalingGroupName'], ScalingProcesses=['AZRebalance'])
980+
cfnresponse.send(event, context, cfnresponse.SUCCESS, {}, "CustomResourcePhysicalID")
981+
except BaseException as err:
982+
print('ERROR: ', err)
983+
cfnresponse.send(event, context, cfnresponse.FAILED, {}, "CustomResourcePhysicalID")
984+
Handler: index.handler
985+
Role: !GetAtt AsgProcessSuspenderRole.Arn
986+
Runtime: 'python3.7'
987+
988+
AzRebalancingSuspender:
989+
Type: AWS::CloudFormation::CustomResource
990+
Version: 1.0
991+
Properties:
992+
ServiceToken: !GetAtt AzRebalancingSuspenderFunction.Arn
993+
AutoScalingGroupName: !Ref AgentAutoScaleGroup
994+
942995
SecurityGroup:
943996
Type: AWS::EC2::SecurityGroup
944997
Condition: CreateSecurityGroup
@@ -1072,3 +1125,4 @@ Resources:
10721125
Action: "lambda:InvokeFunction"
10731126
Principal: "events.amazonaws.com"
10741127
SourceArn: !GetAtt AutoscalingLambdaScheduledRule.Arn
1128+

0 commit comments

Comments
 (0)