@@ -1209,8 +1209,6 @@ used to confirm that the new pod condition introduced is being:
1209
1209
This section must be completed when targeting beta to a release.
1210
1210
-->
1211
1211
1212
- Skipping this section at the Alpha stage and will populate at Beta.
1213
-
1214
1212
###### How can a rollout or rollback fail? Can it impact already running workloads?
1215
1213
1216
1214
<!--
@@ -1223,13 +1221,39 @@ rollout. Similarly, consider large clusters and how enablement/disablement
1223
1221
will rollout across nodes.
1224
1222
-->
1225
1223
1224
+ This flag is only relevant for the Kubelet. Therefore, the new condition will be
1225
+ reported for pods scheduled on nodes that have the feature enabled.
1226
+
1227
+ A controller or service that consumes the new pod condition should be enabled
1228
+ only after rollout of the new condition has succeeded on all nodes. Similarly,
1229
+ the controller or service that consumes the new pod condition should be disabled
1230
+ before the rollback. This helps prevent a controller/service consuming the
1231
+ condition getting the data from pods running in a subset of the nodes in the
1232
+ middle of a rollout or rollback.
1233
+
1226
1234
###### What specific metrics should inform a rollback?
1227
1235
1228
1236
<!--
1229
1237
What signals should users be paying attention to when the feature is young
1230
1238
that might indicate a serious problem?
1231
1239
-->
1232
1240
1241
+ A sharp increase in the number of PATCH requests to API Server from Kubelets
1242
+ after enabling this feature is a sign of potential problem and can inform a
1243
+ rollback. A cluster operator may monitor
1244
+ ```
1245
+ apiserver_request_total{verb="PATCH", resource="pods", subresource="status"}
1246
+ ```
1247
+ for this.
1248
+
1249
+ This may be the case in clusters that use a special runtime environment like
1250
+ microVM/Kata, where the sandbox may crash repeatedly (without ever getting a
1251
+ chance to start containers) resulting in lots of potential updates due to the
1252
+ new condition "flapping". However, in such environments, this may already be the
1253
+ case with existing pod conditions like ContainersReady and Ready (unless the
1254
+ sandbox environment/VM crashes very early before a single container is run).
1255
+ Batching of pod status updates from the Kubelet status manager will also help.
1256
+
1233
1257
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
1234
1258
1235
1259
<!--
@@ -1238,20 +1262,26 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
1238
1262
are missing a bunch of machinery and tooling and can't do that now.
1239
1263
-->
1240
1264
1265
+ Upgrade/downgrade of Kubelet incorporating this feature (preceded with draining
1266
+ of pods) has been tested successfully.
1267
+
1268
+ New pods scheduled on the node after un-cordoning following node
1269
+ upgrade/downgrade surface the expected pod conditions.
1270
+
1241
1271
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
1242
1272
1243
1273
<!--
1244
1274
Even if applying deprecation policies, they may still surprise some users.
1245
1275
-->
1246
1276
1277
+ No
1278
+
1247
1279
### Monitoring Requirements
1248
1280
1249
1281
<!--
1250
1282
This section must be completed when targeting beta to a release.
1251
1283
-->
1252
1284
1253
- Skipping this section at the Alpha stage and will populate at Beta.
1254
-
1255
1285
###### How can an operator determine if the feature is in use by workloads?
1256
1286
1257
1287
<!--
@@ -1260,6 +1290,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
1260
1290
logs or events for this purpose.
1261
1291
-->
1262
1292
1293
+ This question isn't totally relevant for this feature, since this is an
1294
+ administrator-enabled feature controlled via kubelet flag, not something the
1295
+ user controls with an API server resource spec.
1296
+
1297
+ Checking the Pod conditions on nodes with this feature enabled is the simplest
1298
+ way to check if the feature is enabled properly on a vanilla k8s cluster.
1299
+
1263
1300
###### How can someone using this feature know that it is working for their instance?
1264
1301
1265
1302
<!--
@@ -1273,8 +1310,8 @@ Recall that end users cannot usually observe component logs or access metrics.
1273
1310
1274
1311
- [ ] Events
1275
1312
- Event Reason:
1276
- - [ ] API .status
1277
- - Condition name:
1313
+ - [x ] API .status
1314
+ - Condition name: PodReadyToStartContainers reported for pod
1278
1315
- Other field:
1279
1316
- [ ] Other (treat as last resort)
1280
1317
- Details:
@@ -1302,12 +1339,8 @@ question.
1302
1339
Pick one more of these and delete the rest.
1303
1340
-->
1304
1341
1305
- - [ ] Metrics
1306
- - Metric name:
1307
- - [ Optional] Aggregation method:
1308
- - Components exposing the metric:
1309
- - [ ] Other (treat as last resort)
1310
- - Details:
1342
+ - [x] Other (treat as last resort)
1343
+ - Details: There are no specific SLIs for the Kubelet Status Manager
1311
1344
1312
1345
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1313
1346
@@ -1316,6 +1349,15 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
1316
1349
implementation difficulties, etc.).
1317
1350
-->
1318
1351
1352
+ New metrics may be added to the Kubelet status manager to surface fine grained
1353
+ information about updates to overall pod status as well as specific pod
1354
+ conditions. However, such a change affects the whole Kubelet Status Manager
1355
+ (rather than specific pod conditions) and thus beyond the scope of this KEP.
1356
+
1357
+ A general Kubernetes metrics collector like Kube State Metrics (that already
1358
+ consume pod condifitions and surface those as metrics) will need to be enhanced
1359
+ to consume the new pod condition in this KEP.
1360
+
1319
1361
### Dependencies
1320
1362
1321
1363
<!--
0 commit comments