Skip to content

Commit ec40afa

Browse files
committed
update code formatting
1 parent 214eaf9 commit ec40afa

File tree

1 file changed

+56
-69
lines changed

1 file changed

+56
-69
lines changed

content/blog/argo-rollout-aws-alb.md

Lines changed: 56 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ weight: 1
99

1010
The blog discusses resolving a deployment issue with 502 errors on AWS EKS using AWS ALB and Argo Rollouts. It details the root cause, attempted solutions, and resulting trade-offs.
1111

12-
## The Issue Encountered
12+
## Context
1313

14-
We encountered a 502 at the AWS load balancer level during the deployment of one of our client services. This resulted in 1-2 minutes of downtime, which was unexpected even if the deployment was carried out during the day (during peak traffic). Nearly 15,000 requests were made to the high-throughput service every minute. Over 10 percent of requests resulted in 502 errors. Even during peak hours, the new version of the service was supposed to be rolled out without interfering with users' individual requests.
14+
We encountered a 502 at the AWS load balancer level during the deployment of one of our client's service. This resulted in 1-2 minutes of downtime, which was unexpected even if the deployment was carried out during the day (during peak traffic). Nearly 15,000 requests were made to the high-throughput service every minute. Over 10 percent of requests resulted in 502 errors. Even during peak hours, the new version of the service was supposed to be rolled out without interfering with users' individual requests.
1515

1616
## The Deployment Strategy
1717

@@ -34,49 +34,41 @@ The way the ALB manages traffic is the main source of the issue. Traffic is dire
3434
1. As far as we know, Kubernetes uses probes to identify unhealthy pods and removes them from the list of service endpoints. Thus, the probes are required. In order for Kubelet to properly communicate with Kubepoxy and remove the endpoint in the event of a failure, we have confirmed the intervals of checks once more.
3535
2. To give Kube proxy enough time to remove the endpoint from service endpoint lists, we must add an extra delay. Pre-stop hooks were added to allow for a 60-second container-level sleep before pod termination. Following this, the system will send the SIGTERM signal to complete shutdown.
3636

37-
```yaml
38-
39-
spec:
40-
containers:
41-
lifecycle:
42-
preStop:
43-
exec:
44-
command:
45-
- /bin/sh
46-
- '-c'
47-
- sleep 60
48-
49-
```
37+
```yaml
38+
spec:
39+
containers:
40+
lifecycle:
41+
preStop:
42+
exec:
43+
command:
44+
- /bin/sh
45+
- '-c'
46+
- sleep 60
47+
```
5048
5149
3. We have included a GraceFullShutdown of 90 seconds so that the application can process the existing requests. Upon receiving a SIGTERM signal, the application will no longer accept new ones and will take some time to process them. We also need to add some code to the application to make it understand that it should gracefully shut down before calling SIGKILL to end the application entirely in order to handle GraceFullShutdown.
5250
53-
```yaml
54-
55-
spec:
56-
terminationGracePeriodSeconds: 90
57-
58-
```
51+
```yaml
52+
spec:
53+
terminationGracePeriodSeconds: 90
54+
```
5955
6056
4. Current configurations are the most important thing that can help us manage requests on older pods from the Kubernetes side, but we also need to consider things from the standpoint of AWS-ALB, since all traffic is routed through AWS-ALB and needs to be routed if the pods are in the termination state.
6157
5. We increased the interval seconds from 15 to 10 seconds in order to more aggressively check the targets' health. By doing this, the load balancer won't send traffic to pods that are in termination state.
6258
63-
```yaml
64-
65-
metadata:
66-
annotations:
67-
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "10"
68-
69-
```
59+
```yaml
60+
metadata:
61+
annotations:
62+
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "10"
63+
```
7064
7165
6. Additionally, we want ALB to wait a suitable amount of time before eliminating targets that Kubernetes has labeled as unhealthy or terminated. To account for this, we have included a 30-second deregistration delay.
7266
73-
```yaml
74-
75-
metadata:
76-
annotations:
77-
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
78-
79-
```
67+
```yaml
68+
metadata:
69+
annotations:
70+
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
71+
```
8072
8173
## Do These Solutions Prove to be Beneficial?
8274
@@ -89,39 +81,35 @@ Interestingly, during our careful observation of the canary deployment steps, we
8981
### What Definition Did This Canary Step Have?
9082
9183
```yaml
92-
93-
canary:
94-
steps:
95-
- setCanaryScale:
96-
weight: 20
97-
- setWeight: 0
98-
- pause: { duration: 60 }
99-
- setCanaryScale:
100-
matchTrafficWeight: true
101-
- setWeight: 10
102-
- pause: { duration: 60 }
103-
- setWeight: 60
104-
- pause: { duration: 60 }
105-
- setWeight: 80
106-
- pause: { duration: 60 }
107-
- setWeight: 90
108-
- pause: { duration: 60 }
109-
- setWeight: 100
110-
- pause: { duration: 60 }
111-
84+
canary:
85+
steps:
86+
- setCanaryScale:
87+
weight: 20
88+
- setWeight: 0
89+
- pause: { duration: 60 }
90+
- setCanaryScale:
91+
matchTrafficWeight: true
92+
- setWeight: 10
93+
- pause: { duration: 60 }
94+
- setWeight: 60
95+
- pause: { duration: 60 }
96+
- setWeight: 80
97+
- pause: { duration: 60 }
98+
- setWeight: 90
99+
- pause: { duration: 60 }
100+
- setWeight: 100
101+
- pause: { duration: 60 }
112102
```
113103
114104
The Argo Rollout canary configurations mentioned above appear to be normal, but the problem only surfaced during the final phase of increasing traffic from 90% to 100%. To be sure, we slightly adjusted the final configuration steps as shown below, which has assisted in bringing down the request count from 10% to 1%.
115105
116106
```yaml
117-
118-
- setWeight: 90
119-
- pause: { duration: 60 }
120-
- setWeight: 99
121-
- pause: { duration: 60 }
122-
- setWeight: 100
123-
- pause: { duration: 60 }
124-
107+
- setWeight: 90
108+
- pause: { duration: 60 }
109+
- setWeight: 99
110+
- pause: { duration: 60 }
111+
- setWeight: 100
112+
- pause: { duration: 60 }
125113
```
126114
127115
## Root Cause
@@ -133,12 +121,11 @@ We discovered that our use of **dynamicStableScale** caused older replica sets t
133121
To address this, we disabled **dynamicStableScale** and increased the **scaleDownDelaySeconds** from 30 seconds (default) to 60 seconds, which will wait for 60 seconds before scaling down the older replicaset pods.
134122
135123
```yaml
136-
137-
spec:
138-
strategy:
139-
canary:
140-
dynamicStableScale: false
141-
scaleDownDelaySeconds: 60
124+
spec:
125+
strategy:
126+
canary:
127+
dynamicStableScale: false
128+
scaleDownDelaySeconds: 60
142129

143130
```
144131

0 commit comments

Comments
 (0)