Apps can not auto-scale until an `autoscaling` deploy has successfully completed

Since https://github.com/guardian/riff-raff/pull/83 back in April 2013, Riff Raff [`autoscaling`](https://riffraff.gutools.co.uk/docs/magenta-lib/types#autoscaling) deploys have always disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, once deployment has successfully completed:

https://github.com/guardian/riff-raff/blob/60eb09f08db8806a42e1df2e2d666fc1004a513d/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala#L170-L205

There are good reasons for this, but it leads to two problems:

- Even during a [successful](https://riffraff.gutools.co.uk/deployment/view/86cfcc20-9b8f-45e4-95ef-9f03f3c86889) deploy, there's **a window of ~3 minutes when the app cannot scale**
- More severely, if a deploy fails, the app will be left with *ASG scaling alarms disabled*, requiring developers to manually re-enable them (or run another deploy) - the response time to this can be much longer, even hours.

For apps where sudden unpredictable bursts of traffic can occur, where many deploys can take place every day, this adds up to significant windows of time where the odds are eventually that a deploy will coincide with a spike in traffic that they are unable to respond to.

## Ophan Tracker outage - 22nd May 2024
[_full incident summary_](https://docs.google.com/document/d/1AesO6g-uz-YWoMzXZIKvHpcD-RVFiv3RjpIgD7-EpIs/edit)

- 16:04 - Ophan PR [#6109](https://github.com/guardian/ophan/pull/6109), a minor change to the Ophan Dashboard, is merged. This will trigger a deploy of all Ophan apps, including the Ophan Tracker.
- 16:11 - App Notification for major news story [_Rishi Sunak will call general election for July this afternoon in surprise move, senior sources tell the Guardian_](https://dashboard.ophan.co.uk/info?capi-id=politics/article/2024/may/22/rishi-sunak-will-call-general-election-for-july-in-surprise-move-sources&referring-host=GuardianPush) is sent out: <img width="1150" alt="image" src="https://github.com/guardian/riff-raff/assets/52038/cf5a4c08-ed5e-493a-ab6b-4210b1a547bf">
- 16:12:02 - Riff Raff [deploy](https://riffraff.gutools.co.uk/deployment/view/c7a9de78-d7b5-4b36-9f04-854f198247ec) disables auto-scaling alarms, with the size of the ASG set to 3 instances
- 16:13:32 - Ophan Tracker's [scale-up alarm](https://eu-west-1.console.aws.amazon.com/cloudwatch/home?region=eu-west-1#alarmsV2:alarm/Ophan-Tracker-PROD-CPUHighAlarm8A23B76E-1NV4HX246S2H4?~(search~'Ophan-Tracker-PROD-CPUHighAlarm8A23B76E-1NV4HX246S2H4)) enters ALARM status. The Tracker ASG would normally scale up on 2 consecutive ALARM states 1 minute apart, but ASG scale-up has been disabled by the deploy.
![image](https://github.com/guardian/riff-raff/assets/52038/bb2b114a-c2ba-4df3-a036-249ef54e6bf3)


- 16:14:26 - Riff Raff deploy culls the 3 old instances, taking the ASG size back to 3 instances - the cluster is now very under-scaled for the spike in traffic
- 16:14:37 - Riff Raff deploy starts the final `WaitForStabilization`, which is the last step before re-enabling alarms. Due to the servers being so overloaded, they never stabilise. The step has a 15 minute timeout.
- 16:29:42 - The deploy finally fails as `WaitForStabilization` times out, and the alarms are left disabled.
- 17:19:30 – Tracker ASG is manually scaled up to 6 instances by the Ophan team
- 17:23:12 – Tracker ASG stops terminating unhealthy instances - the outage has lasted just over 1 hour
- 17:30:41 - Alarms are finally re-enabled by the Ophan team performing a new [deploy](https://riffraff.gutools.co.uk/deployment/view/5769e851-fda1-484c-a938-b68389d522f4)

In this case, had `ResumeAlarmNotifications` been enabled immediately _before_ `WaitForStabilization`, the deploy would have failed, but the outage would probably have ended within a minute or 2 of 16:14, giving a 2 minute outage, rather than a 1 hour outage.



	SuspendAlarmNotifications(autoScalingGroup, target.region),
	TagCurrentInstancesWithTerminationTag(autoScalingGroup, target.region),
	ProtectCurrentInstances(autoScalingGroup, target.region),
	DoubleSize(autoScalingGroup, target.region),
	HealthcheckGrace(
	autoScalingGroup,
	target.region,
	healthcheckGrace(pkg, target, reporter) * 1000
	),
	WaitForStabilization(
	autoScalingGroup,
	secondsToWait(pkg, target, reporter) * 1000,
	target.region
	),
	WarmupGrace(
	autoScalingGroup,
	target.region,
	warmupGrace(pkg, target, reporter) * 1000
	),
	WaitForStabilization(
	autoScalingGroup,
	secondsToWait(pkg, target, reporter) * 1000,
	target.region
	),
	CullInstancesWithTerminationTag(autoScalingGroup, target.region),
	TerminationGrace(
	autoScalingGroup,
	target.region,
	terminationGrace(pkg, target, reporter) * 1000
	),
	WaitForStabilization(
	autoScalingGroup,
	secondsToWait(pkg, target, reporter) * 1000,
	target.region
	),
	ResumeAlarmNotifications(autoScalingGroup, target.region)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apps can not auto-scale until an `autoscaling` deploy has successfully completed #1342

Ophan Tracker outage - 22nd May 2024

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Apps can not auto-scale until an autoscaling deploy has successfully completed #1342

Description

Ophan Tracker outage - 22nd May 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Apps can not auto-scale until an `autoscaling` deploy has successfully completed #1342