Skip to content

Commit 78574a3

Browse files
committed
Update PRR for Beta
Signed-off-by: Deep Debroy <[email protected]>
1 parent d052b7b commit 78574a3

File tree

1 file changed

+54
-12
lines changed
  • keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation

1 file changed

+54
-12
lines changed

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md

Lines changed: 54 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1209,8 +1209,6 @@ used to confirm that the new pod condition introduced is being:
12091209
This section must be completed when targeting beta to a release.
12101210
-->
12111211

1212-
Skipping this section at the Alpha stage and will populate at Beta.
1213-
12141212
###### How can a rollout or rollback fail? Can it impact already running workloads?
12151213

12161214
<!--
@@ -1223,13 +1221,39 @@ rollout. Similarly, consider large clusters and how enablement/disablement
12231221
will rollout across nodes.
12241222
-->
12251223

1224+
This flag is only relevant for the Kubelet. Therefore, the new condition will be
1225+
reported for pods scheduled on nodes that have the feature enabled.
1226+
1227+
A controller or service that consumes the new pod condition should be enabled
1228+
only after rollout of the new condition has succeeded on all nodes. Similarly,
1229+
the controller or service that consumes the new pod condition should be disabled
1230+
before the rollback. This helps prevent a controller/service consuming the
1231+
condition getting the data from pods running in a subset of the nodes in the
1232+
middle of a rollout or rollback.
1233+
12261234
###### What specific metrics should inform a rollback?
12271235

12281236
<!--
12291237
What signals should users be paying attention to when the feature is young
12301238
that might indicate a serious problem?
12311239
-->
12321240

1241+
A sharp increase in the number of PATCH requests to API Server from Kubelets
1242+
after enabling this feature is a sign of potential problem and can inform a
1243+
rollback. A cluster operator may monitor
1244+
```
1245+
apiserver_request_total{verb="PATCH", resource="pods", subresource="status"}
1246+
```
1247+
for this.
1248+
1249+
This may be the case in clusters that use a special runtime environment like
1250+
microVM/Kata, where the sandbox may crash repeatedly (without ever getting a
1251+
chance to start containers) resulting in lots of potential updates due to the
1252+
new condition "flapping". However, in such environments, this may already be the
1253+
case with existing pod conditions like ContainersReady and Ready (unless the
1254+
sandbox environment/VM crashes very early before a single container is run).
1255+
Batching of pod status updates from the Kubelet status manager will also help.
1256+
12331257
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
12341258

12351259
<!--
@@ -1238,20 +1262,26 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
12381262
are missing a bunch of machinery and tooling and can't do that now.
12391263
-->
12401264

1265+
Upgrade/downgrade of Kubelet incorporating this feature (preceded with draining
1266+
of pods) has been tested successfully.
1267+
1268+
New pods scheduled on the node after un-cordoning following node
1269+
upgrade/downgrade surface the expected pod conditions.
1270+
12411271
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
12421272

12431273
<!--
12441274
Even if applying deprecation policies, they may still surprise some users.
12451275
-->
12461276

1277+
No
1278+
12471279
### Monitoring Requirements
12481280

12491281
<!--
12501282
This section must be completed when targeting beta to a release.
12511283
-->
12521284

1253-
Skipping this section at the Alpha stage and will populate at Beta.
1254-
12551285
###### How can an operator determine if the feature is in use by workloads?
12561286

12571287
<!--
@@ -1260,6 +1290,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
12601290
logs or events for this purpose.
12611291
-->
12621292

1293+
This question isn't totally relevant for this feature, since this is an
1294+
administrator-enabled feature controlled via kubelet flag, not something the
1295+
user controls with an API server resource spec.
1296+
1297+
Checking the Pod conditions on nodes with this feature enabled is the simplest
1298+
way to check if the feature is enabled properly on a vanilla k8s cluster.
1299+
12631300
###### How can someone using this feature know that it is working for their instance?
12641301

12651302
<!--
@@ -1273,8 +1310,8 @@ Recall that end users cannot usually observe component logs or access metrics.
12731310

12741311
- [ ] Events
12751312
- Event Reason:
1276-
- [ ] API .status
1277-
- Condition name:
1313+
- [x] API .status
1314+
- Condition name: PodReadyToStartContainers reported for pod
12781315
- Other field:
12791316
- [ ] Other (treat as last resort)
12801317
- Details:
@@ -1302,12 +1339,8 @@ question.
13021339
Pick one more of these and delete the rest.
13031340
-->
13041341

1305-
- [ ] Metrics
1306-
- Metric name:
1307-
- [Optional] Aggregation method:
1308-
- Components exposing the metric:
1309-
- [ ] Other (treat as last resort)
1310-
- Details:
1342+
- [x] Other (treat as last resort)
1343+
- Details: There are no specific SLIs for the Kubelet Status Manager
13111344

13121345
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
13131346

@@ -1316,6 +1349,15 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
13161349
implementation difficulties, etc.).
13171350
-->
13181351

1352+
New metrics may be added to the Kubelet status manager to surface fine grained
1353+
information about updates to overall pod status as well as specific pod
1354+
conditions. However, such a change affects the whole Kubelet Status Manager
1355+
(rather than specific pod conditions) and thus beyond the scope of this KEP.
1356+
1357+
A general Kubernetes metrics collector like Kube State Metrics (that already
1358+
consume pod condifitions and surface those as metrics) will need to be enhanced
1359+
to consume the new pod condition in this KEP.
1360+
13191361
### Dependencies
13201362

13211363
<!--

0 commit comments

Comments
 (0)