Skip to content

Commit fff1bdb

Browse files
authored
Merge pull request kubernetes#3577 from danwinship/kep-3453-minimize-iptables-beta
KEP-3453 minimize iptables-restore to Beta
2 parents 475b18a + c88cd7a commit fff1bdb

File tree

4 files changed

+95
-73
lines changed

4 files changed

+95
-73
lines changed

keps/prod-readiness/sig-network/3453.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 3453
55
alpha:
66
approver: "@wojtek-t"
7+
beta:
8+
approver: "@wojtek-t"
64.2 KB
Loading

keps/sig-network/3453-minimize-iptables-restore/README.md

Lines changed: 88 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -181,17 +181,6 @@ any reported failures would be as likely to be the result of a bug in
181181
the checking code as they would be to indicate a bug in the actual
182182
syncing code.
183183

184-
```
185-
<<[UNRESOLVED]>>
186-
187-
We need to decide whether we are going to try to implement such a
188-
check and associated metric. (For the initial Alpha release, we will
189-
not; if we decide to add it later that may postpone the transition to
190-
Beta.)
191-
192-
<<[/UNRESOLVED]>>
193-
```
194-
195184
## Design Details
196185

197186
### Test Plan
@@ -244,18 +233,10 @@ appropriate code for that.
244233

245234
#### Beta
246235

247-
- If we decide that we need to implement the "partial sync gives
248-
different result than full sync" metric, then we will not go to Beta
249-
until at least one release has passed with that metric available in
250-
Alpha.
251-
252236
- No new bugs
253237
- Feature actually improves performance (eg, in the `gce-5000Nodes`
254238
periodic job).
255-
- Real-world users trying out the Alpha feature have reported both
256-
(a) no new bugs, and (b) improved performance.
257-
- General e2e tests modified to check the "partial sync failures"
258-
metric at some point or points during the test run, TBD.
239+
- Additional metrics to distinguish partial and full resync times
259240

260241
#### GA
261242

@@ -280,7 +261,7 @@ ways.
280261

281262
###### How can this feature be enabled / disabled in a live cluster?
282263

283-
- [ ] Feature gate (also fill in values in `kep.yaml`)
264+
- [X] Feature gate (also fill in values in `kep.yaml`)
284265
- Feature gate name: `MinimizeIPTablesRestore`
285266
- Components depending on the feature gate:
286267
- kube-proxy
@@ -308,19 +289,23 @@ Nothing different than enabling it the first time.
308289

309290
No, but:
310291

311-
- even with the new code enabled, kube-proxy will always do a full
292+
- Even with the new code enabled, kube-proxy will always do a full
312293
sync the first time it syncs after startup (regardless of the
313294
pre-existing iptables state), so a test that switched from
314295
"disabled" to "enabled" would not really test anything different
315296
than a test that just brought up the cluster with the feature
316297
enabled in the first place.
317298

318-
- even with the new code enabled, kube-proxy will periodically do a
299+
- Even with the new code enabled, kube-proxy will periodically do a
319300
full resync, so a test that switched from "enabled" to "disabled"
320301
would not really test anything that wouldn't also be tested by just
321302
letting kube-proxy run for a while with the feature enabled. (Where
322303
"a while" is much less than the length of an e2e run.)
323304

305+
- The new code does not make any API writes (or change any other
306+
persistent cluster or node state besides the iptables rules), so
307+
there is nothing else that would need to be validated.
308+
324309
### Rollout, Upgrade and Rollback Planning
325310

326311
###### How can a rollout or rollback fail? Can it impact already running workloads?
@@ -344,17 +329,19 @@ rules.
344329

345330
###### What specific metrics should inform a rollback?
346331

347-
We should add metrics to catch the failure modes noted above in [Risks
348-
and Mitigations](#risks-and-mitigations); in particular we should make
349-
it noticeable when failed partial resyncs occur.
332+
The new kube-proxy
333+
`sync_proxy_rules_iptables_partial_restore_failures_total` metric
334+
indicates when kube-proxy is falling back from a partial resync to a
335+
full resync due to an unexpected `iptables-restore` error. While it may
336+
eventually be non-0 due to unrelated unexpected `iptables-restore`
337+
errors, it should not ever be steadily increasing, and if it was, that
338+
would likely indicate a bug in the code.
350339

351340
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
352341

353-
<!--
354-
Describe manual testing that was done and the outcomes.
355-
Longer term, we may want to require automated upgrade/rollback tests, but we
356-
are missing a bunch of machinery and tooling and can't do that now.
357-
-->
342+
No, but as described above, the ordinary operation of kube-proxy with
343+
the feature enabled includes behavior that is equivalent to what would
344+
happen during an upgrade or rollback.
358345

359346
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
360347

@@ -376,33 +363,71 @@ is enabled, then it is in use.
376363

377364
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
378365

379-
TBD
380-
381366
Sync performance should be much better in large clusters, and should
382-
be somewhat independent of cluster size. We will have better answers
383-
after it sees more real-world(-ish) use.
384-
385-
`sync_proxy_rules_duration_seconds`
386-
`network_programming_duration_seconds`
387-
`sync_proxy_rules_endpoint_changes_pending`
388-
`sync_proxy_rules_service_changes_pending`
389-
`sync_proxy_rules_last_queued_timestamp_seconds` vs `sync_proxy_rules_last_timestamp_seconds`
390-
367+
be somewhat independent of cluster size.
368+
369+
![graph of performance over time](5000Node-perf.png)
370+
371+
Enabling this feature on the 5000-node scalability test immediately
372+
resulted in a 2x improvement to the
373+
`network_programming_duration_seconds` (aka
374+
`NetworkProgrammingLatency`) metric in the 50th, 90th, and 99th
375+
percentile buckets, with the 50th percentile going from about 27
376+
seconds to about 13 (the first drop in the graph above).
377+
378+
On further investigation, we realized that it could theoretically have
379+
improved further, but the tests were running kube-proxy with
380+
`--iptables-min-sync-period 10s`, which introduced its own latency.
381+
Dropping them to `--iptables-min-sync-period 1s` (the default) caused
382+
a further drop in the 50th percentile latency, to about 5 seconds (the
383+
second drop in the graph above) though at a cost of increased CPU
384+
usage (since `syncProxyRules()`, the heart of kube-proxy, runs 10
385+
times as often now). (90th and 99th percentile latency did not change,
386+
because they are more dominated by the time it takes to do _full_
387+
resyncs. We realized that we should probably add separate partial/full
388+
sync time metrics to get more fine-grained information.)
389+
390+
For 1.26 we added a new section to the kube-proxy documentation
391+
describing the kube-proxy `--sync-period` and
392+
`--iptables-min-sync-period` options ([website #28435]), which also
393+
mentions that the `MinimizeIPTablesRestore` feature reduces the need
394+
for `--iptables-min-sync-period` and that administrators using that
395+
feature should try out smaller min-sync-period values. For 1.27, we
396+
should add a note about the latency vs CPU usage tradeoff, and we
397+
should make sure to have a release note about this as well.
398+
399+
[website #28435]: https://github.com/kubernetes/website/pull/38435
391400

392401
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
393402

394403
- [X] Metrics
395-
- Metric name: sync_proxy_rules_iptables_partial_restore_failures_total
396-
- [Optional] Aggregation method: ??? (I don't know what this means)
404+
- Metric names:
405+
- sync_proxy_rules_iptables_partial_restore_failures_total
406+
- sync_full_proxy_rules_duration_seconds
407+
- sync_partial_proxy_rules_duration_seconds
397408
- Components exposing the metric:
398409
- kube-proxy
399410

400411
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
401412

402-
<!--
403-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
404-
implementation difficulties, etc.).
405-
-->
413+
As of 1.26, there are no metrics for distinguishing the average
414+
partial sync time from the average full sync time, which would help
415+
administrators to understand kube-proxy's performance better and
416+
figure out a good `--iptables-min-sync-period` value. We will add
417+
those metrics for beta.
418+
419+
As discussed above under [Subtle Synchronization
420+
Delays](#subtle-synchronization-delays), it would theoretically be
421+
possible to have kube-proxy check the results of the partial restores
422+
against an equivalent full restore, to ensure that it got the result
423+
it was expecting. However, as also discussed there, this would be
424+
quite difficult to implement, and would involve much more new code
425+
than the actual partial-restores feature. Given that, and the fact
426+
that e2e testing during Alpha has not turned up any problems with the
427+
feature, it seems that if we tried to sanity check the rules in that
428+
way, a failure would be more likely to indicate a bug in the
429+
rule-sanity-checking code than a bug in the rule-generation code.
430+
Thus, we are not implementing such a metric.
406431

407432
### Dependencies
408433

@@ -437,15 +462,17 @@ take by operations related to kube-proxy SLIs/SLOs.
437462

438463
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
439464

440-
There should be no effect on disk or IO.
465+
There should be no effect on disk, IO, or RAM.
441466

442-
There should be a major _decrease_ in CPU usage by kube-proxy's
443-
process group (not by kube-proxy itself, but by the `iptables-restore`
444-
processes it spawns.)
467+
The feature itself should result in a _decrease_ in the CPU usage of
468+
kube-proxy's process group, since the `iptables-restore` processes it
469+
spawns will have less work to do.
445470

446-
The current PR would not result in any non-negligible increase in
447-
kube-proxy RAM usage, although some variations of it might involve
448-
keeping track of more information in memory between resyncs.
471+
(If administrators also follow the recommendation to lower
472+
`--iptables-min-sync-period`, then that will _increase_ kube-proxy CPU
473+
usage, since kube-proxy will be waking up to resync more often. But
474+
this is a CPU usage increase that the administrator is more-or-less
475+
explicitly requesting.)
449476

450477
### Troubleshooting
451478

@@ -455,28 +482,19 @@ It does not change kube-proxy's behavior in this case.
455482

456483
###### What are other known failure modes?
457484

458-
(No _known_ failure modes yet. See [Risks and
485+
No known failure modes. See [Risks and
459486
Mitigations](#risks-and-mitigations) for theorized potential failure
460-
modes.)
461-
462-
<!--
463-
For each of them, fill in the following information by copying the below template:
464-
- [Failure mode brief description]
465-
- Detection: How can it be detected via metrics? Stated another way:
466-
how can an operator troubleshoot without logging into a master or worker node?
467-
- Mitigations: What can be done to stop the bleeding, especially for already
468-
running user workloads?
469-
- Diagnostics: What are the useful log messages and their required logging
470-
levels that could help debug the issue?
471-
Not required until feature graduated to beta.
472-
- Testing: Are there any tests for failure mode? If not, describe why.
473-
-->
487+
modes.
474488

475489
###### What steps should be taken if SLOs are not being met to determine the problem?
476490

477491
## Implementation History
478492

479493
- 2022-08-02: Initial proposal
494+
- 2022-10-04: Initial KEP merged
495+
- 2022-11-01: Code PR merged
496+
- 2022-11-10: Testing PRs merged; 5000Nodes test begins enabling the feature
497+
- 2022-12-08: Kubernetes 1.26 released
480498

481499
## Drawbacks
482500

keps/sig-network/3453-minimize-iptables-restore/kep.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,17 @@ approvers:
1212
- "@thockin"
1313

1414
# The target maturity stage in the current dev cycle for this KEP.
15-
stage: alpha
15+
stage: beta
1616

1717
# The most recent milestone for which work toward delivery of this KEP has been
1818
# done. This can be the current (upcoming) milestone, if it is being actively
1919
# worked on.
20-
latest-milestone: "v1.26"
20+
latest-milestone: "v1.27"
2121

2222
# The milestone at which this feature was, or is targeted to be, at each stage.
2323
milestone:
2424
alpha: "v1.26"
25-
beta:
25+
beta: "v1.27"
2626
stable:
2727

2828
# The following PRR answers are required at alpha release
@@ -36,3 +36,5 @@ disable-supported: true
3636
# The following PRR answers are required at beta release
3737
metrics:
3838
- sync_proxy_rules_iptables_partial_restore_failures_total
39+
- sync_full_proxy_rules_duration_seconds
40+
- sync_partial_proxy_rules_duration_seconds

0 commit comments

Comments
 (0)