@@ -181,17 +181,6 @@ any reported failures would be as likely to be the result of a bug in
181
181
the checking code as they would be to indicate a bug in the actual
182
182
syncing code.
183
183
184
- ```
185
- <<[UNRESOLVED]>>
186
-
187
- We need to decide whether we are going to try to implement such a
188
- check and associated metric. (For the initial Alpha release, we will
189
- not; if we decide to add it later that may postpone the transition to
190
- Beta.)
191
-
192
- <<[/UNRESOLVED]>>
193
- ```
194
-
195
184
## Design Details
196
185
197
186
### Test Plan
@@ -244,18 +233,10 @@ appropriate code for that.
244
233
245
234
#### Beta
246
235
247
- - If we decide that we need to implement the "partial sync gives
248
- different result than full sync" metric, then we will not go to Beta
249
- until at least one release has passed with that metric available in
250
- Alpha.
251
-
252
236
- No new bugs
253
237
- Feature actually improves performance (eg, in the ` gce-5000Nodes `
254
238
periodic job).
255
- - Real-world users trying out the Alpha feature have reported both
256
- (a) no new bugs, and (b) improved performance.
257
- - General e2e tests modified to check the "partial sync failures"
258
- metric at some point or points during the test run, TBD.
239
+ - Additional metrics to distinguish partial and full resync times
259
240
260
241
#### GA
261
242
@@ -280,7 +261,7 @@ ways.
280
261
281
262
###### How can this feature be enabled / disabled in a live cluster?
282
263
283
- - [ ] Feature gate (also fill in values in ` kep.yaml ` )
264
+ - [X ] Feature gate (also fill in values in ` kep.yaml ` )
284
265
- Feature gate name: ` MinimizeIPTablesRestore `
285
266
- Components depending on the feature gate:
286
267
- kube-proxy
@@ -308,19 +289,23 @@ Nothing different than enabling it the first time.
308
289
309
290
No, but:
310
291
311
- - even with the new code enabled, kube-proxy will always do a full
292
+ - Even with the new code enabled, kube-proxy will always do a full
312
293
sync the first time it syncs after startup (regardless of the
313
294
pre-existing iptables state), so a test that switched from
314
295
"disabled" to "enabled" would not really test anything different
315
296
than a test that just brought up the cluster with the feature
316
297
enabled in the first place.
317
298
318
- - even with the new code enabled, kube-proxy will periodically do a
299
+ - Even with the new code enabled, kube-proxy will periodically do a
319
300
full resync, so a test that switched from "enabled" to "disabled"
320
301
would not really test anything that wouldn't also be tested by just
321
302
letting kube-proxy run for a while with the feature enabled. (Where
322
303
"a while" is much less than the length of an e2e run.)
323
304
305
+ - The new code does not make any API writes (or change any other
306
+ persistent cluster or node state besides the iptables rules), so
307
+ there is nothing else that would need to be validated.
308
+
324
309
### Rollout, Upgrade and Rollback Planning
325
310
326
311
###### How can a rollout or rollback fail? Can it impact already running workloads?
@@ -344,17 +329,19 @@ rules.
344
329
345
330
###### What specific metrics should inform a rollback?
346
331
347
- We should add metrics to catch the failure modes noted above in [ Risks
348
- and Mitigations] ( #risks-and-mitigations ) ; in particular we should make
349
- it noticeable when failed partial resyncs occur.
332
+ The new kube-proxy
333
+ ` sync_proxy_rules_iptables_partial_restore_failures_total ` metric
334
+ indicates when kube-proxy is falling back from a partial resync to a
335
+ full resync due to an unexpected ` iptables-restore ` error. While it may
336
+ eventually be non-0 due to unrelated unexpected ` iptables-restore `
337
+ errors, it should not ever be steadily increasing, and if it was, that
338
+ would likely indicate a bug in the code.
350
339
351
340
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
352
341
353
- <!--
354
- Describe manual testing that was done and the outcomes.
355
- Longer term, we may want to require automated upgrade/rollback tests, but we
356
- are missing a bunch of machinery and tooling and can't do that now.
357
- -->
342
+ No, but as described above, the ordinary operation of kube-proxy with
343
+ the feature enabled includes behavior that is equivalent to what would
344
+ happen during an upgrade or rollback.
358
345
359
346
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
360
347
@@ -376,33 +363,71 @@ is enabled, then it is in use.
376
363
377
364
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
378
365
379
- TBD
380
-
381
366
Sync performance should be much better in large clusters, and should
382
- be somewhat independent of cluster size. We will have better answers
383
- after it sees more real-world(-ish) use.
384
-
385
- ` sync_proxy_rules_duration_seconds `
386
- ` network_programming_duration_seconds `
387
- ` sync_proxy_rules_endpoint_changes_pending `
388
- ` sync_proxy_rules_service_changes_pending `
389
- ` sync_proxy_rules_last_queued_timestamp_seconds ` vs ` sync_proxy_rules_last_timestamp_seconds `
390
-
367
+ be somewhat independent of cluster size.
368
+
369
+ ![ graph of performance over time] ( 5000Node-perf.png )
370
+
371
+ Enabling this feature on the 5000-node scalability test immediately
372
+ resulted in a 2x improvement to the
373
+ ` network_programming_duration_seconds ` (aka
374
+ ` NetworkProgrammingLatency ` ) metric in the 50th, 90th, and 99th
375
+ percentile buckets, with the 50th percentile going from about 27
376
+ seconds to about 13 (the first drop in the graph above).
377
+
378
+ On further investigation, we realized that it could theoretically have
379
+ improved further, but the tests were running kube-proxy with
380
+ ` --iptables-min-sync-period 10s ` , which introduced its own latency.
381
+ Dropping them to ` --iptables-min-sync-period 1s ` (the default) caused
382
+ a further drop in the 50th percentile latency, to about 5 seconds (the
383
+ second drop in the graph above) though at a cost of increased CPU
384
+ usage (since ` syncProxyRules() ` , the heart of kube-proxy, runs 10
385
+ times as often now). (90th and 99th percentile latency did not change,
386
+ because they are more dominated by the time it takes to do _ full_
387
+ resyncs. We realized that we should probably add separate partial/full
388
+ sync time metrics to get more fine-grained information.)
389
+
390
+ For 1.26 we added a new section to the kube-proxy documentation
391
+ describing the kube-proxy ` --sync-period ` and
392
+ ` --iptables-min-sync-period ` options ([ website #28435 ] ), which also
393
+ mentions that the ` MinimizeIPTablesRestore ` feature reduces the need
394
+ for ` --iptables-min-sync-period ` and that administrators using that
395
+ feature should try out smaller min-sync-period values. For 1.27, we
396
+ should add a note about the latency vs CPU usage tradeoff, and we
397
+ should make sure to have a release note about this as well.
398
+
399
+ [ website #28435 ] : https://github.com/kubernetes/website/pull/38435
391
400
392
401
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
393
402
394
403
- [X] Metrics
395
- - Metric name: sync_proxy_rules_iptables_partial_restore_failures_total
396
- - [ Optional] Aggregation method: ??? (I don't know what this means)
404
+ - Metric names:
405
+ - sync_proxy_rules_iptables_partial_restore_failures_total
406
+ - sync_full_proxy_rules_duration_seconds
407
+ - sync_partial_proxy_rules_duration_seconds
397
408
- Components exposing the metric:
398
409
- kube-proxy
399
410
400
411
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
401
412
402
- <!--
403
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
404
- implementation difficulties, etc.).
405
- -->
413
+ As of 1.26, there are no metrics for distinguishing the average
414
+ partial sync time from the average full sync time, which would help
415
+ administrators to understand kube-proxy's performance better and
416
+ figure out a good ` --iptables-min-sync-period ` value. We will add
417
+ those metrics for beta.
418
+
419
+ As discussed above under [ Subtle Synchronization
420
+ Delays] ( #subtle-synchronization-delays ) , it would theoretically be
421
+ possible to have kube-proxy check the results of the partial restores
422
+ against an equivalent full restore, to ensure that it got the result
423
+ it was expecting. However, as also discussed there, this would be
424
+ quite difficult to implement, and would involve much more new code
425
+ than the actual partial-restores feature. Given that, and the fact
426
+ that e2e testing during Alpha has not turned up any problems with the
427
+ feature, it seems that if we tried to sanity check the rules in that
428
+ way, a failure would be more likely to indicate a bug in the
429
+ rule-sanity-checking code than a bug in the rule-generation code.
430
+ Thus, we are not implementing such a metric.
406
431
407
432
### Dependencies
408
433
@@ -437,15 +462,17 @@ take by operations related to kube-proxy SLIs/SLOs.
437
462
438
463
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
439
464
440
- There should be no effect on disk or IO .
465
+ There should be no effect on disk, IO, or RAM .
441
466
442
- There should be a major _ decrease_ in CPU usage by kube-proxy's
443
- process group (not by kube-proxy itself, but by the ` iptables-restore `
444
- processes it spawns.)
467
+ The feature itself should result in a _ decrease_ in the CPU usage of
468
+ kube-proxy's process group, since the ` iptables-restore ` processes it
469
+ spawns will have less work to do.
445
470
446
- The current PR would not result in any non-negligible increase in
447
- kube-proxy RAM usage, although some variations of it might involve
448
- keeping track of more information in memory between resyncs.
471
+ (If administrators also follow the recommendation to lower
472
+ ` --iptables-min-sync-period ` , then that will _ increase_ kube-proxy CPU
473
+ usage, since kube-proxy will be waking up to resync more often. But
474
+ this is a CPU usage increase that the administrator is more-or-less
475
+ explicitly requesting.)
449
476
450
477
### Troubleshooting
451
478
@@ -455,28 +482,19 @@ It does not change kube-proxy's behavior in this case.
455
482
456
483
###### What are other known failure modes?
457
484
458
- ( No _ known _ failure modes yet . See [ Risks and
485
+ No known failure modes. See [ Risks and
459
486
Mitigations] ( #risks-and-mitigations ) for theorized potential failure
460
- modes.)
461
-
462
- <!--
463
- For each of them, fill in the following information by copying the below template:
464
- - [Failure mode brief description]
465
- - Detection: How can it be detected via metrics? Stated another way:
466
- how can an operator troubleshoot without logging into a master or worker node?
467
- - Mitigations: What can be done to stop the bleeding, especially for already
468
- running user workloads?
469
- - Diagnostics: What are the useful log messages and their required logging
470
- levels that could help debug the issue?
471
- Not required until feature graduated to beta.
472
- - Testing: Are there any tests for failure mode? If not, describe why.
473
- -->
487
+ modes.
474
488
475
489
###### What steps should be taken if SLOs are not being met to determine the problem?
476
490
477
491
## Implementation History
478
492
479
493
- 2022-08-02: Initial proposal
494
+ - 2022-10-04: Initial KEP merged
495
+ - 2022-11-01: Code PR merged
496
+ - 2022-11-10: Testing PRs merged; 5000Nodes test begins enabling the feature
497
+ - 2022-12-08: Kubernetes 1.26 released
480
498
481
499
## Drawbacks
482
500
0 commit comments