Skip to content

Commit d896185

Browse files
committed
Added blueprints area, plus changes to the cloud-resilience guidance
1 parent 00cdfa3 commit d896185

File tree

4 files changed

+33
-8
lines changed

4 files changed

+33
-8
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ The framework is a companion to:
4040

4141
The framework consists of:
4242

43-
* [Engineering principles](principles.md)
43+
* [Engineering principles](principles.md) and [blueprints](blueprints.md)
4444
* [Engineering quality review tool](insights/review.md)
4545
* [Communities of practice guidelines](communities/communities-of-practice.md) and active communities:
4646
* [Product Development Test Automation Working Group](communities/pd-test-automation-working-group.md)

blueprints.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Engineering blueprints
2+
3+
This is a list of blueprint solutions to common problems which are referenced within this quality framework:
4+
5+
- [Cross-account backups on AWS](blueprints/backups-aws.md)

blueprints/backups-aws.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Cross-account backups on AWS
2+
3+
## Context
4+
5+
- These notes are part of a broader set of [blueprints](../blueprints.md)
6+
- This blueprint relates to [service reliability](../practices/service-reliability.md) and specifically to use of [cloud services](../practices/cloud-services.md)
7+
8+
## TBC
9+
10+
... TBC ...

practices/cloud-services.md

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,18 +21,28 @@
2121
- Prefer serverless platform as a service (PaaS) over infrastructure as a service (IaaS) (see [outsource bottom up](../patterns/outsource-bottom-up.md)).
2222
- Where not serverless use ephemeral and immutable infrastructure.
2323
- Engage your cloud supplier early on in the development process. They have various tools and processes to help you (e.g. [AWS Well-Architected Review](https://aws.amazon.com/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc)).
24-
- Understand cloud supplier SLAs.
25-
- Make systems self-healing.
26-
- Prefer technologies which are resilient by default.
27-
- Favour global-scoped (e.g. [CloudFront](https://aws.amazon.com/cloudfront/) or [Front Door](https://azure.microsoft.com/en-gb/pricing/details/frontdoor/)) or region-scoped services (e.g. [S3](https://aws.amazon.com/s3/), [Lambda](https://aws.amazon.com/lambda/), [Azure Functions](https://azure.microsoft.com/en-gb/services/functions/)) to availability-zone (AZ) scoped (e.g. [VMs](https://azure.microsoft.com/en-gb/services/virtual-machines/), [RDS DBs](https://aws.amazon.com/rds/)) or single-instance services (e.g. [EC2 instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)).
24+
- Make systems self-healing and resilient:
25+
- Be aware that terms such as "region" have different meanings across different cloud vendors
26+
- For example, it is not valid to compare the number of UK "regions" in AWS and Azure
27+
- High levels of resilience can be achieved using UK-based cloud services for providers such as AWS and Azure, if the full scope & resilience of the clouds is used
28+
- Also sometimes conflated in terms of resilience are cross-account and cross-region:
29+
- As a minimum, all systems should have a tamper-proof cross-account backup to protect against account compromise, e.g. ransomware atttack: see [blueprint for AWS-based systems](../blueprints/backups-aws.md)
30+
- You may wish to additionally consider cross region backups to protect against region failure
31+
- Be aware of the resilience of any systems on which your system depends - for example, in a region-failure scenario, a standby for your system in a second region won't help if your system relies on another system which only runs in the single region which has failed
32+
- Be aware of the difference between the resilience of cloud and your system's resilience in cloud
33+
- Understand the SLAs of the cloud services you use.
34+
- Every cloud service you use introduces more dependencies and more opportunities for service issues ...
35+
- ... but, bespoke engineering to avoid using cloud vendor services introduces additional complexity and opportunities for reliability issues
36+
- ... and, the risks are typically far greater for bespoke engineering, therefore: favour cloud services over bespoke engineering
37+
- Prefer technologies which are resilient by default: favour global-scoped (e.g. [CloudFront](https://aws.amazon.com/cloudfront/) or [Front Door](https://azure.microsoft.com/en-gb/pricing/details/frontdoor/)) or region-scoped services (e.g. [S3](https://aws.amazon.com/s3/), [Lambda](https://aws.amazon.com/lambda/), [Azure Functions](https://azure.microsoft.com/en-gb/services/functions/)) to availability-zone (AZ) scoped (e.g. [VMs](https://azure.microsoft.com/en-gb/services/virtual-machines/), [RDS DBs](https://aws.amazon.com/rds/)) or single-instance services (e.g. [EC2 instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)).
2838
- For AZ-scoped services, use redundancy to create required resilience (e.g. [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html) or [Azure Scale/Availability Sets](https://docs.microsoft.com/en-us/azure/virtual-machines/availability)), and:
2939
- For stateless components use active-active configurations across AZs (e.g. running stateless containers across multiple AZs using [AWS Elastic Kubernetes Service](https://aws.amazon.com/eks/))
3040
- For stateful components, e.g. databases, consider use of active-active configurations across AZs (e.g. [Aurora Multi-Master](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html)), but be aware of the added complexity conflict resolution for asynchronous replication can bring and potential performance impact where synchronous replication is chosen.
3141
- Consider use of multiple regions (e.g. for AWS eu-west-1 [Dublin] as well as eu-west-2 [London]) as a way to improve availability, though ensure data sovereignty implications are understood and accepted (see below).
3242
- Understand failover (e.g. [RDS failover](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html#:~:text=Failover%20times%20are%20typically%2060%E2%80%93120%20seconds.)) and failed instance replacement times and engineer to accommodate these.
33-
- Be aware of data sovereignty implications of using any systems hosted outside the UK.
34-
- Make sure your information governance lead is aware and included in decision making.
35-
- Consider SaaS tools the team uses as well as the systems we build.
43+
- Be aware of data sovereignty implications of using any systems hosted outside the UK.
44+
- Make sure your information governance lead is aware and included in decision making.
45+
- Consider SaaS tools the team uses as well as the systems we build.
3646
- Services should scale automatically up and down.
3747
- If possible, drive scaling based on metrics which matter to users (e.g. response time), but balance this with the benefits of choosing leading indicators (e.g. CPU usage) to avoid slow scaling from impacting user experience.
3848
- Understand how rapidly demand can spike and ensure scaling can meet these requirements. Balance scaling needs with the desire to avoid over provisioning and use [pre-warming](https://petrutandrei.wordpress.com/2016/03/18/pre-warming-the-load-balancer-in-aws/) of judiciously where required. Discuss this with the cloud provider well before go live they can assist with pre-warming processes ([AWS](https://aws.amazon.com/premiumsupport/programs/iem/)).

0 commit comments

Comments
 (0)