The agent autoscaling group should never rebalance availability zones

yob · yob · commit 7f880f938b3e · 2020-10-22T01:02:25.000+11:00
By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ. That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while). Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline. An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to. Some resources I found useful: 1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/ 2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05 3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html 4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
diff --git a/templates/aws-stack.yml b/templates/aws-stack.yml
@@ -939,6 +939,59 @@ Resources:
       AutoScalingReplacingUpdate:
         WillReplace: true
 
+  AsgProcessSuspenderRole:
+    Type: AWS::IAM::Role
+    Properties:
+      AssumeRolePolicyDocument:
+        Version: 2012-10-17
+        Statement:
+        - Action: ['sts:AssumeRole']
+          Effect: Allow
+          Principal:
+            Service: [lambda.amazonaws.com]
+      ManagedPolicyArns:
+        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
+      Policies:
+      - PolicyName: AsgProcessModification
+        PolicyDocument:
+          Version: 2012-10-17
+          Statement:
+          - Effect: Allow
+            Action:
+            - 'autoscaling:SuspendProcesses'
+            Resource: '*'
+
+  AzRebalancingSuspenderFunction:
+    Type: AWS::Lambda::Function
+    Properties:
+      Description: 'Disables AZ Rebalancing on the agent ASG'
+      Code:
+        ZipFile: |
+          import cfnresponse
+          import boto3
+          def handler(event, context):
+            try:
+              if event['RequestType'] == 'Delete':
+                cfnresponse.send(event, context, cfnresponse.SUCCESS, {}, "CustomResourcePhysicalID")
+              else:
+                client = boto3.client('autoscaling')
+                props = event['ResourceProperties']
+                response = client.suspend_processes(AutoScalingGroupName=props['AutoScalingGroupName'], ScalingProcesses=['AZRebalance'])
+                cfnresponse.send(event, context, cfnresponse.SUCCESS, {}, "CustomResourcePhysicalID")
+            except BaseException as err:
+              print('ERROR: ', err)
+              cfnresponse.send(event, context, cfnresponse.FAILED, {}, "CustomResourcePhysicalID")
+      Handler: index.handler
+      Role: !GetAtt AsgProcessSuspenderRole.Arn
+      Runtime: 'python3.7'
+
+  AzRebalancingSuspender:
+    Type: AWS::CloudFormation::CustomResource
+    Version: 1.0
+    Properties:
+      ServiceToken: !GetAtt AzRebalancingSuspenderFunction.Arn
+      AutoScalingGroupName: !Ref AgentAutoScaleGroup
+
   SecurityGroup:
     Type: AWS::EC2::SecurityGroup
     Condition: CreateSecurityGroup
@@ -1072,3 +1125,4 @@ Resources:
       Action: "lambda:InvokeFunction"
       Principal: "events.amazonaws.com"
       SourceArn: !GetAtt AutoscalingLambdaScheduledRule.Arn
+