Skip to content

Commit a9cb910

Browse files
authored
Add longevity test results for 1.2.0 (#1737)
Problem Ensure 1.2.0 NGF runs without errors over long period of time and doesn't introduce any memory leaks. Solution: - Run longevity tests for 1.2.0 -Share the results. CLOSES -- #1647
1 parent 5465747 commit a9cb910

16 files changed

+238
-0
lines changed
Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
# Results
2+
3+
Note: product telemetry feature was enabled but sending of product telemetry data was disabled.
4+
5+
## NGINX OSS
6+
7+
### Test environment
8+
9+
NGINX Plus: false
10+
11+
GKE Cluster:
12+
13+
- Node count: 3
14+
- k8s version: v1.27.8-gke.1067004
15+
- vCPUs per node: 2
16+
- RAM per node: 4022900Ki
17+
- Max pods per node: 110
18+
- Zone: us-central1-c
19+
- Instance Type: e2-medium
20+
21+
NGF pod name -- ngf-longevity-nginx-gateway-fabric-7f596f74c5-xzkzb
22+
23+
### Traffic
24+
25+
HTTP:
26+
27+
```text
28+
Running 5760m test @ http://cafe.example.com/coffee
29+
2 threads and 100 connections
30+
Thread Stats Avg Stdev Max +/- Stdev
31+
Latency 183.83ms 143.49ms 2.00s 79.14%
32+
Req/Sec 303.07 204.23 2.22k 66.90%
33+
204934013 requests in 5760.00m, 71.25GB read
34+
Socket errors: connect 0, read 344459, write 0, timeout 5764
35+
Requests/sec: 592.98
36+
Transfer/sec: 216.19KB
37+
```
38+
39+
HTTPS:
40+
41+
```text
42+
Running 5760m test @ https://cafe.example.com/tea
43+
2 threads and 100 connections
44+
Thread Stats Avg Stdev Max +/- Stdev
45+
Latency 175.23ms 122.10ms 2.00s 68.72%
46+
Req/Sec 301.92 203.60 1.95k 66.97%
47+
204120642 requests in 5760.00m, 69.83GB read
48+
Socket errors: connect 0, read 337203, write 0, timeout 246
49+
Requests/sec: 590.63
50+
Transfer/sec: 211.87KB
51+
```
52+
53+
### Logs
54+
55+
No error logs in nginx-gateway
56+
57+
No error logs in nginx
58+
59+
### Key Metrics
60+
61+
#### Containers memory
62+
63+
![oss-memory.png](oss-memory.png)
64+
65+
Drop in NGINX memory usage corresponds to the end of traffic generation.
66+
67+
#### NGF Container Memory
68+
69+
![oss-ngf-memory.png](oss-ngf-memory.png)
70+
71+
### Containers CPU
72+
73+
![oss-cpu.png](oss-cpu.png)
74+
75+
Drop in NGINX CPU usage corresponds to the end of traffic generation.
76+
77+
### NGINX metrics
78+
79+
![oss-stub-status.png](oss-stub-status.png)
80+
81+
Drop in request corresponds to the end of traffic generation.
82+
83+
84+
### Reloads
85+
86+
Rate of reloads - successful and errors:
87+
88+
![oss-reloads.png](oss-reloads.png)
89+
90+
91+
Reload spikes correspond to 1 hour periods of backend re-rollouts.
92+
However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
93+
triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
94+
95+
No reloads finished with an error.
96+
97+
Reload time distribution - counts:
98+
99+
![oss-reload-time.png](oss-reload-time.png)
100+
101+
102+
Reload related metrics at the end:
103+
104+
![oss-final-reloads.png](oss-final-reloads.png)
105+
106+
All successful reloads took less than 5 seconds, with most under 1 second.
107+
108+
## NGINX Plus
109+
110+
### Test environment
111+
112+
NGINX Plus: false
113+
114+
GKE Cluster:
115+
116+
- Node count: 3
117+
- k8s version: v1.27.8-gke.1067004
118+
- vCPUs per node: 2
119+
- RAM per node: 4022900Ki
120+
- Max pods per node: 110
121+
- Zone: us-central1-c
122+
- Instance Type: e2-medium
123+
124+
NGF pod name -- ngf-longevity-nginx-gateway-fabric-fc7f6bcf-cnlww
125+
126+
### Traffic
127+
128+
HTTP:
129+
130+
```text
131+
Running 5760m test @ http://cafe.example.com/coffee
132+
2 threads and 100 connections
133+
Thread Stats Avg Stdev Max +/- Stdev
134+
Latency 173.03ms 120.83ms 2.00s 68.41%
135+
Req/Sec 313.29 209.75 2.11k 65.95%
136+
211857930 requests in 5760.00m, 74.04GB read
137+
Socket errors: connect 0, read 307, write 0, timeout 118
138+
Non-2xx or 3xx responses: 6
139+
Requests/sec: 613.01
140+
Transfer/sec: 224.63KB
141+
```
142+
143+
HTTPS:
144+
145+
```text
146+
Running 5760m test @ https://cafe.example.com/tea
147+
2 threads and 100 connections
148+
Thread Stats Avg Stdev Max +/- Stdev
149+
Latency 173.25ms 120.87ms 2.00s 68.37%
150+
Req/Sec 312.62 209.06 1.95k 66.02%
151+
211427067 requests in 5760.00m, 72.76GB read
152+
Socket errors: connect 0, read 284, write 0, timeout 92
153+
vresponses: 4
154+
Requests/sec: 611.77
155+
Transfer/sec: 220.77KB
156+
```
157+
158+
Note: Non-2xx or 3xx responses correspond to the error in NGINX log, see below.
159+
160+
### Logs
161+
162+
nginx-gateway:
163+
164+
a lot of expected "usage reporting not enabled" errors.
165+
166+
```text
167+
INFO 2024-03-20T14:13:00.372305088Z [resource.labels.containerName: nginx-gateway] {"level":"info", "msg":"Wait completed, proceeding to shutdown the manager", "ts":"2024-03-20T14:13:00Z"}
168+
ERROR 2024-03-20T14:13:00.374159128Z [resource.labels.containerName: nginx-gateway] {"error":"leader election lost", "level":"error", "msg":"error received after stop sequence was engaged", "stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 sigs.k8s.io/[email protected]/pkg/manager/internal.go:490", "ts":"2024-03-20T14:13:00Z"}
169+
```
170+
171+
The error occurred during shutdown. Needs further investigation if the shutdown process should be fixed.
172+
https://github.com/nginxinc/nginx-gateway-fabric/issues/1735
173+
174+
nginx:
175+
176+
```text
177+
ERROR 2024-03-17T21:11:11.017601264Z [resource.labels.containerName: nginx] 2024/03/17 21:11:10 [error] 43#43: *211045372 no live upstreams while connecting to upstream, client: 10.128.0.19, server: cafe.example.com, request: "GET /tea HTTP/1.1", upstream: "http://longevity_tea_80/tea", host: "cafe.example.com"
178+
```
179+
180+
10 errors like that occurred at different times. They occurred when backend pods were updated. Not clear why that happens.
181+
Because number of errors is small compared with total handled requests (211857930 + 211427067), no need to further
182+
investigate unless we see it in the future again at larger volume.
183+
184+
### Key Metrics
185+
186+
#### Containers memory
187+
188+
![plus-memory.png](plus-memory.png)
189+
190+
Drop in NGINX memory usage corresponds to the end of traffic generation.
191+
192+
#### NGF Container Memory
193+
194+
![plus-ngf-memory.png](plus-ngf-memory.png)
195+
196+
### Containers CPU
197+
198+
![plus-cpu.png](plus-cpu.png)
199+
200+
Drop in NGINX CPU usage corresponds to the end of traffic generation.
201+
202+
### NGINX Plus metrics
203+
204+
![plus-status.png](plus-status.png)
205+
206+
Drop in request corresponds to the end of traffic generation.
207+
208+
### Reloads
209+
210+
Rate of reloads - successful and errors:
211+
212+
![plus-reloads.png](plus-reloads.png)
213+
214+
Note: compared to NGINX, we don't have as many reloads here, because NGF uses NGINX Plus API to reconfigure NGINX
215+
for endpoints changes.
216+
217+
However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
218+
triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
219+
220+
No reloads finished with an error.
221+
222+
Reload time distribution - counts:
223+
224+
![plus-reload-time.png](plus-reload-time.png)
225+
226+
Reload related metrics at the end:
227+
228+
![plus-final-reloads.png](plus-final-reloads.png)
229+
230+
All successful reloads took less than 1 seconds, with most under 0.5 second.
231+
232+
## Comparison with previous runs
233+
234+
If we compare with 1.1.0 results, we will see that NGF container memory usage is 2 times higher.
235+
That is probably due to a bug in metric visualization for 1.1.0 results (using mean instead of sum for aggregation).
236+
Running 1.1.0 in a similar cluster yields only slightly less memory usage (-2..-1MB)
237+
238+
![oss-1.1.0-memory.png](oss-1.1.0-memory.png)
59.5 KB
Loading
61.2 KB
Loading
20 KB
Loading
78.9 KB
Loading
68.8 KB
Loading
79.3 KB
Loading
57 KB
Loading
116 KB
Loading
61.3 KB
Loading

0 commit comments

Comments
 (0)