@@ -130,6 +130,7 @@ tags, and then generate with `hack/update-toc.sh`.
130
130
- [ Graduation Criteria] ( #graduation-criteria )
131
131
- [ Alpha] ( #alpha )
132
132
- [ Beta] ( #beta )
133
+ - [ GA] ( #ga )
133
134
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
134
135
- [ Upgrade strategy] ( #upgrade-strategy )
135
136
- [ Downgrade strategy] ( #downgrade-strategy )
@@ -1102,7 +1103,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
1102
1103
https://storage.googleapis.com/k8s-triage/index.html
1103
1104
-->
1104
1105
1105
- - < test >: < link to test coverage > FIXME to fill
1106
+ No integration tests are planned. We'll cover this with e2e_node tests.
1106
1107
1107
1108
##### e2e tests
1108
1109
@@ -1116,7 +1117,8 @@ https://storage.googleapis.com/k8s-triage/index.html
1116
1117
We expect no non-infra related flakes in the last month as a GA graduation criteria.
1117
1118
-->
1118
1119
1119
- - <test >: <link to test coverage > FIXME to fill
1120
+ - Test failures: https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=SidecarContainers
1121
+ - All related tests can be filtered with the SidecarContainers
1120
1122
1121
1123
##### Ready state of a sidecar container is properly used to create/delete endpoints
1122
1124
@@ -1128,16 +1130,20 @@ The sidecar container can expose ports and can be used to handle external traffi
1128
1130
1129
1131
##### Pod lifecycle scenarios without sidecar containers
1130
1132
1131
- TBD: describe test cases that can be affected by introducing sidecar containers.
1133
+ - Init containers should start after the previous init container has completed
1134
+ - Regular containers should start in parallel after all init containers have completed
1135
+ - Restart behavior of the init containers
1132
1136
1133
1137
##### Pod lifecycle scenarios with sidecar containers
1134
1138
1135
- TBD: describe test cases to test with sidecar containers.
1139
+ - Init containers should start after the previous restartable init container has started
1140
+ - Regular containers should start in parallel after all regular init containers have completed and all restartable init containers have started
1141
+ - Restartable init containers should restart always
1136
1142
1137
1143
##### Kubelet restart test cases
1138
1144
1139
- TBD: describe test cases on how Pod with sidecar containers behaves when kubelet
1140
- was restarted
1145
+ - It should restart the containers in the right order after the node reboot
1146
+ - It should not restart any completed init containers after the kubelet restart
1141
1147
1142
1148
##### API server is down: failure to update containers status during initialization
1143
1149
@@ -1254,13 +1260,15 @@ in back-to-back releases.
1254
1260
#### Beta
1255
1261
1256
1262
- Implement proper termination ordering.
1257
- - Provide defaults for ` restartPolicy ` field on init containers, ` nil ` is not
1258
- an ideal default long term.
1259
- - Allow to apply security policies on all containers in ` initContainers `
1260
- collection. Example may be disabling ` kubectl exec ` on containers in
1261
- ` initContainers ` collection.
1263
+ - Allow sidecar containers to restart during the shutdown of the Pod.
1264
+ - Add tests with feature activation and deactivation (see [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback ) ).
1262
1265
1266
+ #### GA
1263
1267
1268
+ - Enable ` livenessProbe ` during the shutdown of the Pod for sidecar containers that restart.
1269
+ - Allow to apply security policies on all containers in ` initContainers `
1270
+ collection. Example may be disabling ` kubectl exec ` on containers in
1271
+ ` initContainers ` collection.
1264
1272
1265
1273
### Upgrade / Downgrade Strategy
1266
1274
@@ -1298,13 +1306,19 @@ Pods that has already been created will stay being scheduled after the downgrade
1298
1306
not be rejected by control
1299
1307
plane nor by kubelet. Both will treat the sidecar container as an Init container.
1300
1308
This may render the Pod unusable as it will stuck in initialization forever -
1301
- sidecar container are never exiting. We will document this behavior for Alpha release.
1302
- Promoting feature to Beta we will revisit the situation. If we will see this as
1303
- a major issue, we will need to wait for 3 releases so kubelet will have the logic
1309
+ sidecar container are never exiting.
1310
+ This behavior has been documented for Alpha release, but we don't see it as a
1311
+ major issue requiring to wait for 3 releases so kubelet will have the logic
1304
1312
to reject such Pods when the feature gate is disabled to keep Downgrade safe.
1305
1313
1306
- ** Note** , For the control plane and kubelet we will implement logic to reject Pods
1307
- with sidecar containers when feature gate got turned off.
1314
+ ** Note** , We have implemented logic for the
1315
+ [ kubelet] ( https://github.com/kubernetes/kubernetes/blob/f19b62fc0914b38941922afefd1e34eb55f87ee7/pkg/kubelet/lifecycle/predicate.go#L78-L91 )
1316
+ to reject Pods with sidecar containers when feature gate is turned off.
1317
+ For the control plane -
1318
+ [ kube-apiserver] ( https://github.com/kubernetes/kubernetes/blob/f19b62fc0914b38941922afefd1e34eb55f87ee7/pkg/api/pod/util.go#L554-L560 )
1319
+ is dropping the field (if it wasn't set before) and
1320
+ [ kube-scheduler] ( https://github.com/kubernetes/kubernetes/blob/f19b62fc0914b38941922afefd1e34eb55f87ee7/pkg/scheduler/framework/plugins/noderesources/fit.go#L256-L262 )
1321
+ is keeping pods with the field set unschedulable.
1308
1322
See [ Upgrade/downgrade testing] ( #upgradedowngrade-testing ) section.
1309
1323
1310
1324
Workloads will have to be deleted and recreated with the old way of handling
@@ -1452,7 +1466,7 @@ You can take a look at one potential example of such test in:
1452
1466
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
1453
1467
-->
1454
1468
1455
- See [ Upgrade/downgrade testing ] ( #upgradedowngrade-testing ) .
1469
+ Not yet, but it is planned/required to add them before graduation to Beta .
1456
1470
1457
1471
### Rollout, Upgrade and Rollback Planning
1458
1472
@@ -1472,13 +1486,45 @@ rollout. Similarly, consider large clusters and how enablement/disablement
1472
1486
will rollout across nodes.
1473
1487
-->
1474
1488
1489
+ Rollout could fail for multiple reasons:
1490
+
1491
+ - webhooks that are not recompiled with the new field will strip it out
1492
+ - bug in the resource calculation or CPU reservation logic could render the Pod unschedulable
1493
+ - bug in the kubelet affecting the pod lifecycle could cause the Pod to be stuck in initialization
1494
+
1495
+ However, we have tried to maintain a high coverage of unit tests to ensure we catch these.
1496
+
1497
+ Rollback can fail if a Pod with sidecars is scheduled on a node where the feature
1498
+ is disabled.
1499
+ In that case the Pod will be rejected by kubelet and will be stuck in Pending state.
1500
+ Therefore, we advise to first disable the feature gate on the control plane and then
1501
+ proceed with the nodes.
1502
+
1503
+ Running workloads are not impacted.
1504
+
1505
+ Pods with sidecars might take a long time to exit and exceed the TGPS, a new
1506
+ event should be added in beta to help administrators diagnose this issue.
1507
+ Rather than rolling back the feature, they should work on the graceful termination
1508
+ of their main containers to ensure sidecars have enough time to be notified
1509
+ and exit on their own.
1510
+
1475
1511
###### What specific metrics should inform a rollback?
1476
1512
1477
1513
<!--
1478
1514
What signals should users be paying attention to when the feature is young
1479
1515
that might indicate a serious problem?
1480
1516
-->
1481
1517
1518
+ - [X] Metrics
1519
+ - Metric name: kubelet_started_containers_errors_total
1520
+ - Type: Counter
1521
+ - Labels: code , container_type (should be ` init_container ` )
1522
+ - Components exposing the metric: ` kubelet-metrics `
1523
+ - Symptoms: high number of errors indicates that the kubelet is unable to start the sidecar containers
1524
+ - [X] Events
1525
+ - Event name: TBD
1526
+ - Symptoms: high number of events indicates that the TGPS has been exceeded and sidecars have been terminated not gracefully
1527
+
1482
1528
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
1483
1529
1484
1530
<!--
@@ -1487,12 +1533,41 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
1487
1533
are missing a bunch of machinery and tooling and can't do that now.
1488
1534
-->
1489
1535
1536
+ Upgrade->downgrade->upgrade testing was done manually using the following steps:
1537
+
1538
+ Kubelet specific:
1539
+ 1 . Deploy k8s 1.29-alpha
1540
+ 2 . Enable the ` SidecarContainers ` feature gate on the control plane and kubelet
1541
+ 3 . Deploy a Pod with sidecar containers using a Deployment
1542
+ 4 . Disable the ` SidecarContainers ` feature gate on the kubelet (requires a restart)
1543
+ 5 . Drain the node
1544
+ 6 . Pod is rejected by kubelet
1545
+ 7 . Enable the ` SidecarContainers ` feature gate on the kubelet (requires a restart)
1546
+ 8 . Pod is scheduled and works as expected
1547
+
1548
+ Control plane specific:
1549
+ 1 . Deploy k8s 1.29-alpha
1550
+ 2 . Enable the ` SidecarContainers ` feature gate on the control plane and kubelet
1551
+ 3 . Deploy a Pod with sidecar containers using a Deployment
1552
+ 4 . Disable the ` SidecarContainers ` feature gate on the control plane
1553
+ 5 . Delete the Pod
1554
+ 6 . Pod is created without the new field - init containers are not recognized as sidecars and block the Pod in initialization
1555
+ 7 . Modify the Deployment by moving the sidecar containers to the regular containers section
1556
+ 8 . Pod is scheduled and works (without the sidecar support)
1557
+ 9 . Enable the ` SidecarContainers ` feature gate on the control plane
1558
+ 10 . Delete the Pod
1559
+ 11 . Pod is scheduled and works (without the sidecar support)
1560
+ 12 . Modify the Deployment by moving the sidecar containers to the init containers section
1561
+ 13 . Pod is scheduled and works (with the sidecar support)
1562
+
1490
1563
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
1491
1564
1492
1565
<!--
1493
1566
Even if applying deprecation policies, they may still surprise some users.
1494
1567
-->
1495
1568
1569
+ No.
1570
+
1496
1571
### Monitoring Requirements
1497
1572
1498
1573
<!--
@@ -1510,6 +1585,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
1510
1585
logs or events for this purpose.
1511
1586
-->
1512
1587
1588
+ By checking if ` .spec.initContainers[i].restartPolicy ` is set to ` OnFailure ` or ` Always ` .
1589
+
1513
1590
###### How can someone using this feature know that it is working for their instance?
1514
1591
1515
1592
<!--
@@ -1521,13 +1598,17 @@ and operation of this feature.
1521
1598
Recall that end users cannot usually observe component logs or access metrics.
1522
1599
-->
1523
1600
1524
- - [ ] Events
1525
- - Event Reason:
1526
- - [ ] API .status
1527
- - Condition name:
1528
- - Other field:
1529
- - [ ] Other (treat as last resort)
1530
- - Details:
1601
+ End users can check components that are using the new feature, such as Istio, if istio-proxy runs as a sidecar container:
1602
+
1603
+ ```
1604
+ $ kubectl get pod -o "custom-columns="\
1605
+ "NAME:.metadata.name,"\
1606
+ "INIT:.spec.initContainers[*].name,"\
1607
+ "CONTAINERS:.spec.containers[*].name"
1608
+
1609
+ NAME INIT CONTAINERS
1610
+ sleep-7656cf8794-8fhdk istio-init,istio-proxy sleep
1611
+ ```
1531
1612
1532
1613
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
1533
1614
@@ -1546,18 +1627,33 @@ These goals will help you determine what you need to measure (SLIs) in the next
1546
1627
question.
1547
1628
-->
1548
1629
1630
+ - number of running containers should not change by more than 10% throughout the day,
1631
+ as measured by the number of running containers at the beginning and end of the day
1632
+ - error rate for containers of type init_container should be less than 1%,
1633
+ as measured by the number of errors divided by the total number of init_container containers
1634
+ - number of events indicating that TGPS has been exceeded should be less than 10 per day,
1635
+ as measured by the number of events logged in the kubelet log
1636
+ - 99% of jobs with sidecars should complete successfully,
1637
+ as measured by the number of jobs that complete successfully divided by the total number of jobs with sidecars
1638
+
1549
1639
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
1550
1640
1551
1641
<!--
1552
1642
Pick one more of these and delete the rest.
1553
1643
-->
1554
1644
1555
- - [ ] Metrics
1556
- - Metric name:
1557
- - [ Optional] Aggregation method:
1558
- - Components exposing the metric:
1559
- - [ ] Other (treat as last resort)
1560
- - Details:
1645
+ - [X] Metrics
1646
+ - Metric name: kubelet_running_containers
1647
+ - Type: Gauge
1648
+ - Labels: container_state
1649
+ - Components exposing the metric: ` kubelet-metrics `
1650
+ - Metric name: kubelet_started_containers_errors_total
1651
+ - Type: Counter
1652
+ - Labels: code , container_type (should be ` init_container ` )
1653
+ - Components exposing the metric: ` kubelet-metrics `
1654
+ - [X] Events
1655
+ - Event name: TBD
1656
+ - should not appear, unless TGPS is exceeded and sidecars are terminated
1561
1657
1562
1658
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1563
1659
@@ -1566,6 +1662,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
1566
1662
implementation difficulties, etc.).
1567
1663
-->
1568
1664
1665
+ No.
1666
+
1569
1667
### Dependencies
1570
1668
1571
1669
<!--
@@ -1589,6 +1687,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
1589
1687
- Impact of its degraded performance or high-error rates on the feature:
1590
1688
-->
1591
1689
1690
+ No.
1691
+
1592
1692
### Scalability
1593
1693
1594
1694
<!--
@@ -1616,6 +1716,8 @@ Focusing mostly on:
1616
1716
heartbeats, leader election, etc.)
1617
1717
-->
1618
1718
1719
+ No.
1720
+
1619
1721
###### Will enabling / using this feature result in introducing new API types?
1620
1722
1621
1723
<!--
@@ -1625,6 +1727,8 @@ Describe them, providing:
1625
1727
- Supported number of objects per namespace (for namespace-scoped objects)
1626
1728
-->
1627
1729
1730
+ No.
1731
+
1628
1732
###### Will enabling / using this feature result in any new calls to the cloud provider?
1629
1733
1630
1734
<!--
@@ -1633,6 +1737,8 @@ Describe them, providing:
1633
1737
- Estimated increase:
1634
1738
-->
1635
1739
1740
+ No.
1741
+
1636
1742
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
1637
1743
1638
1744
<!--
@@ -1642,6 +1748,8 @@ Describe them, providing:
1642
1748
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
1643
1749
-->
1644
1750
1751
+ No.
1752
+
1645
1753
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
1646
1754
1647
1755
<!--
@@ -1653,6 +1761,10 @@ Think about adding additional work or introducing new steps in between
1653
1761
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
1654
1762
-->
1655
1763
1764
+ Graceful Pod termination might take longer with sidecars since their exit sequence starts after the
1765
+ last main container has stopped.
1766
+ The impact should be negligible because the TGPS is enforced in all cases.
1767
+
1656
1768
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
1657
1769
1658
1770
<!--
@@ -1665,6 +1777,8 @@ This through this both in small and large cases, again with respect to the
1665
1777
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
1666
1778
-->
1667
1779
1780
+ No.
1781
+
1668
1782
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
1669
1783
1670
1784
<!--
@@ -1677,6 +1791,10 @@ Are there any tests that were run/should be run to understand performance charac
1677
1791
and validate the declared limits?
1678
1792
-->
1679
1793
1794
+ No, since the KEP only enable a new way to run containers as sidecars instead of regular containers.
1795
+ Resource consumption can even be lower since various tricks using emptyDir volumes to perform synchronization
1796
+ (as with istio-proxy) are no longer needed.
1797
+
1680
1798
### Troubleshooting
1681
1799
1682
1800
<!--
@@ -1692,6 +1810,8 @@ details). For now, we leave it here.
1692
1810
1693
1811
###### How does this feature react if the API server and/or etcd is unavailable?
1694
1812
1813
+ Nothing changes compared to the current kubelet behavior.
1814
+
1695
1815
###### What are other known failure modes?
1696
1816
1697
1817
<!--
@@ -1707,8 +1827,27 @@ For each of them, fill in the following information by copying the below templat
1707
1827
- Testing: Are there any tests for failure mode? If not, describe why.
1708
1828
-->
1709
1829
1830
+ - Main containers don't exit within TGPS, leading to sidecars being terminated
1831
+ - Detection: high number of events indicating TGPS has been exceeded
1832
+ - Mitigations: ensure timely termination of main containers
1833
+ - Diagnostics: Events
1834
+ - Testing: TBD
1835
+ - Main container or sidecar use a preStop hook consuming TGPS, leading to remaining sidecars being terminated
1836
+ - Detection: high number of events indicating TGPS has been exceeded
1837
+ - Mitigations: ensure preStop hooks are not delaying termination
1838
+ - Diagnostics: Events
1839
+ - Testing: TBD
1840
+ - Sidecar container uses a preStop hook that make the container exit during Pod shutdown, sidecar is restarted, leading
1841
+ to a CrashLoopBackOff
1842
+ - Detection: sidecar in CrashLoopBackOff during termination
1843
+ - Mitigations: ensure preStop hooks are not making the container to exit, document best practices
1844
+ - Diagnostics: Events
1845
+ - Testing: TBD
1846
+
1710
1847
###### What steps should be taken if SLOs are not being met to determine the problem?
1711
1848
1849
+ None.
1850
+
1712
1851
## Implementation History
1713
1852
1714
1853
<!--
0 commit comments