Skip to content

Commit 126a79d

Browse files
added metrics to PRR
1 parent 2f5c7a9 commit 126a79d

File tree

1 file changed

+26
-5
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+26
-5
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,12 @@ Planned tests:
278278
- Pod failed due to unhealthy device, earlier than device plugin detected it. Pod status is still updated.
279279
- Pod is in crash loop backoff due to unhealthy device - pod status is updated to unhealthy
280280

281+
For alpha rollout and rollback:
282+
283+
- Fields dropped on update when feature gate is disabled
284+
- Field is not populated after the feature gate is disabled
285+
- Field is populated again when the feature gate is enabled
286+
281287
Test coverage will be listed once tests are implemented.
282288

283289
- <test>: <link to test coverage>
@@ -330,15 +336,20 @@ No
330336

331337
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
332338

333-
Yes, with no side effect except of missing the new field in pod status.
339+
Yes, with no side effect except of missing the new field in pod status. Values written
340+
while the feature was enabled will continue to have it and may be wiped on next update request.
341+
They also may be ignored on reads.
342+
Re-enablement of the feature will not guarantee to keep the values written before the
343+
feature was disabled.
334344

335345
###### What happens if we reenable the feature if it was previously rolled back?
336346

337-
The pod status will be updated again.
347+
The pod status will be updated again. Consistency will not be guaranteed for fields written
348+
before the last enablement.
338349

339350
###### Are there any tests for feature enablement/disablement?
340351

341-
Nothing is planned.
352+
Yes, see in e2e tests section.
342353

343354
### Rollout, Upgrade and Rollback Planning
344355

@@ -348,7 +359,10 @@ No
348359

349360
###### What specific metrics should inform a rollback?
350361

351-
N/A
362+
API server error rate increase. `apiserver_request_total` filtered by `code` to be non `2xx`.
363+
API validation error is the most likely indication of an error.
364+
365+
Potential errors on kubelet would likely be exposed as error logs and events on Pods.
352366

353367
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
354368

@@ -378,7 +392,14 @@ N/A
378392

379393
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
380394

381-
N/A
395+
There are a few error modes for this feature:
396+
1. API issues accepting the new field - for example kubelet is writing the field in a format not acceptable by the API server
397+
2. kubelet fails while populating this field
398+
399+
First error mode can be observer with the metric `apiserver_request_total` filtered by `code` to be non `2xx`.
400+
401+
There is no good metric for the second error mode because it will not be clear what part of processing may fail.
402+
The most likely indication of an error would be the increased number of error events on the Pod.
382403

383404
### Dependencies
384405

0 commit comments

Comments
 (0)