You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/argo-rollout-aws-alb.md
+56-69Lines changed: 56 additions & 69 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ weight: 1
9
9
10
10
The blog discusses resolving a deployment issue with 502 errors on AWS EKS using AWS ALB and Argo Rollouts. It details the root cause, attempted solutions, and resulting trade-offs.
11
11
12
-
## The Issue Encountered
12
+
## Context
13
13
14
-
We encountered a 502 at the AWS load balancer level during the deployment of one of our client services. This resulted in 1-2 minutes of downtime, which was unexpected even if the deployment was carried out during the day (during peak traffic). Nearly 15,000 requests were made to the high-throughput service every minute. Over 10 percent of requests resulted in 502 errors. Even during peak hours, the new version of the service was supposed to be rolled out without interfering with users' individual requests.
14
+
We encountered a 502 at the AWS load balancer level during the deployment of one of our client's service. This resulted in 1-2 minutes of downtime, which was unexpected even if the deployment was carried out during the day (during peak traffic). Nearly 15,000 requests were made to the high-throughput service every minute. Over 10 percent of requests resulted in 502 errors. Even during peak hours, the new version of the service was supposed to be rolled out without interfering with users' individual requests.
15
15
16
16
## The Deployment Strategy
17
17
@@ -34,49 +34,41 @@ The way the ALB manages traffic is the main source of the issue. Traffic is dire
34
34
1. As far as we know, Kubernetes uses probes to identify unhealthy pods and removes them from the list of service endpoints. Thus, the probes are required. In order for Kubelet to properly communicate with Kubepoxy and remove the endpoint in the event of a failure, we have confirmed the intervals of checks once more.
35
35
2. To give Kube proxy enough time to remove the endpoint from service endpoint lists, we must add an extra delay. Pre-stop hooks were added to allow for a 60-second container-level sleep before pod termination. Following this, the system will send the SIGTERM signal to complete shutdown.
36
36
37
-
```yaml
38
-
39
-
spec:
40
-
containers:
41
-
lifecycle:
42
-
preStop:
43
-
exec:
44
-
command:
45
-
- /bin/sh
46
-
- '-c'
47
-
- sleep 60
48
-
49
-
```
37
+
```yaml
38
+
spec:
39
+
containers:
40
+
lifecycle:
41
+
preStop:
42
+
exec:
43
+
command:
44
+
- /bin/sh
45
+
- '-c'
46
+
- sleep 60
47
+
```
50
48
51
49
3. We have included a GraceFullShutdown of 90 seconds so that the application can process the existing requests. Upon receiving a SIGTERM signal, the application will no longer accept new ones and will take some time to process them. We also need to add some code to the application to make it understand that it should gracefully shut down before calling SIGKILL to end the application entirely in order to handle GraceFullShutdown.
52
50
53
-
```yaml
54
-
55
-
spec:
56
-
terminationGracePeriodSeconds: 90
57
-
58
-
```
51
+
```yaml
52
+
spec:
53
+
terminationGracePeriodSeconds: 90
54
+
```
59
55
60
56
4. Current configurations are the most important thing that can help us manage requests on older pods from the Kubernetes side, but we also need to consider things from the standpoint of AWS-ALB, since all traffic is routed through AWS-ALB and needs to be routed if the pods are in the termination state.
61
57
5. We increased the interval seconds from 15 to 10 seconds in order to more aggressively check the targets' health. By doing this, the load balancer won't send traffic to pods that are in termination state.
6. Additionally, we want ALB to wait a suitable amount of time before eliminating targets that Kubernetes has labeled as unhealthy or terminated. To account for this, we have included a 30-second deregistration delay.
@@ -89,39 +81,35 @@ Interestingly, during our careful observation of the canary deployment steps, we
89
81
### What Definition Did This Canary Step Have?
90
82
91
83
```yaml
92
-
93
-
canary:
94
-
steps:
95
-
- setCanaryScale:
96
-
weight: 20
97
-
- setWeight: 0
98
-
- pause: { duration: 60 }
99
-
- setCanaryScale:
100
-
matchTrafficWeight: true
101
-
- setWeight: 10
102
-
- pause: { duration: 60 }
103
-
- setWeight: 60
104
-
- pause: { duration: 60 }
105
-
- setWeight: 80
106
-
- pause: { duration: 60 }
107
-
- setWeight: 90
108
-
- pause: { duration: 60 }
109
-
- setWeight: 100
110
-
- pause: { duration: 60 }
111
-
84
+
canary:
85
+
steps:
86
+
- setCanaryScale:
87
+
weight: 20
88
+
- setWeight: 0
89
+
- pause: { duration: 60 }
90
+
- setCanaryScale:
91
+
matchTrafficWeight: true
92
+
- setWeight: 10
93
+
- pause: { duration: 60 }
94
+
- setWeight: 60
95
+
- pause: { duration: 60 }
96
+
- setWeight: 80
97
+
- pause: { duration: 60 }
98
+
- setWeight: 90
99
+
- pause: { duration: 60 }
100
+
- setWeight: 100
101
+
- pause: { duration: 60 }
112
102
```
113
103
114
104
The Argo Rollout canary configurations mentioned above appear to be normal, but the problem only surfaced during the final phase of increasing traffic from 90% to 100%. To be sure, we slightly adjusted the final configuration steps as shown below, which has assisted in bringing down the request count from 10% to 1%.
115
105
116
106
```yaml
117
-
118
-
- setWeight: 90
119
-
- pause: { duration: 60 }
120
-
- setWeight: 99
121
-
- pause: { duration: 60 }
122
-
- setWeight: 100
123
-
- pause: { duration: 60 }
124
-
107
+
- setWeight: 90
108
+
- pause: { duration: 60 }
109
+
- setWeight: 99
110
+
- pause: { duration: 60 }
111
+
- setWeight: 100
112
+
- pause: { duration: 60 }
125
113
```
126
114
127
115
## Root Cause
@@ -133,12 +121,11 @@ We discovered that our use of **dynamicStableScale** caused older replica sets t
133
121
To address this, we disabled **dynamicStableScale** and increased the **scaleDownDelaySeconds** from 30 seconds (default) to 60 seconds, which will wait for 60 seconds before scaling down the older replicaset pods.
0 commit comments