Skip to content

Commit be276fb

Browse files
authored
Merge pull request #5346 from guptaNswati/kep-3695-beta-update
kep-3695-beta update
2 parents 757aba6 + ae728ef commit be276fb

File tree

3 files changed

+44
-22
lines changed

3 files changed

+44
-22
lines changed

keps/prod-readiness/sig-node/3695.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 3695
55
alpha:
66
approver: "@johnbelamaric"
7+
beta:
8+
approver: "@soltysh"

keps/sig-node/3695-pod-resources-for-dra/README.md

Lines changed: 37 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# KEP-3695: Extend the PodResources API to include resources allocated by DRA
1+
KEP-3695: Extend the PodResources API to include resources allocated by DRA
22

33
<!-- toc -->
44
- [Release Signoff Checklist](#release-signoff-checklist)
@@ -36,17 +36,17 @@
3636
Items marked with (R) are required *prior to targeting to a milestone / release*.
3737

3838
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
39-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
40-
- [ ] (R) Design details are appropriately documented
39+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
40+
- [x] (R) Design details are appropriately documented
4141
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
4242
- [ ] e2e Tests for all Beta API Operations (endpoints)
4343
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4444
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
45-
- [ ] (R) Graduation criteria is in place
45+
- [x] (R) Graduation criteria is in place
4646
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
47-
- [ ] (R) Production readiness review completed
47+
- [x] (R) Production readiness review completed
4848
- [ ] (R) Production readiness review approved
49-
- [ ] "Implementation History" section is up-to-date for milestone
49+
- [x] "Implementation History" section is up-to-date for milestone
5050
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
5151
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5252

@@ -107,7 +107,7 @@ to allow querying specific pods for their allocated resources.
107107
returns the list of PodResources for *all* pods across *all* namespaces in the
108108
cluster). That is, it allows one to specify a specific pod and namespace to
109109
retrieve PodResources from, rather than having to query all of them all at
110-
once.
110+
once. `Get()` returns error if the pod is known to the kubelet, but is terminated.
111111

112112
The full PodResources API (including our proposed extensions) can be seen below:
113113

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:
274274

275275
#### Beta
276276

277-
- [ ] Gather feedback from consumers of the DRA feature.
278-
- [ ] No major bugs reported in the previous cycle.
277+
- [x] Gather feedback from consumers of the DRA feature.
278+
- Integration with the NVIDIA DCGM exporter (https://github.com/NVIDIA/dcgm-exporter/pull/501) to gather per pod Dynamic Resources managed by [k8s-dra-driver-gpu](https://github.com/NVIDIA/k8s-dra-driver-gpu).
279+
- [x] No major bugs reported in the previous cycle.
279280

280281
#### GA
281282

@@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed,
333334

334335
###### Are there any tests for feature enablement/disablement?
335336

336-
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
337+
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846)
337338

338339
### Rollout, Upgrade and Rollback Planning
339340

@@ -347,7 +348,12 @@ Kubelet may fail to start. The new API may report inconsistent data, or may caus
347348

348349
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
349350

350-
Not Applicable.
351+
Not Applicable. Because this change:
352+
353+
- Is read-only in the kubelet’s in-memory state.
354+
- Is behind a feature gate, so turning it off simply disables the new endpoints without affecting any existing behavior.
355+
356+
In practice, restart the kubelet with the gate disabled (rollback) or re-enabled (upgrade), and the API behavior reverts or returns without loss of data or consistency. Therefore we don’t need a special upgrade/downgrade test matrix for this KEP.
351357

352358
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
353359

@@ -372,7 +378,9 @@ Call the PodResources API and see the result.
372378

373379
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
374380

375-
N/A.
381+
100% in normal operation. The proposed API exposes in read only mode kubelet internal data, critical for functioning of the kubelet.
382+
This data has to be available 100% of the time for the proper functioning of the kubelet, thus is expected to be available 100% of time.
383+
The only possible error source is the API calls being throttled by the rate-limiting introduced with the GA graduation of the parent KEP 606.
376384

377385
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
378386

@@ -408,36 +416,48 @@ No.
408416

409417
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
410418

411-
No.
419+
No. Enabling this feature does not change the number of API objects returned. But it may increase the size of each object whenever there are Dynamic Resources to report where each ContainerResources now has an extra dynamic_resources field.
412420

413421
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
414422

415423
No. Feature is out of existing any paths in kubelet.
416424

417425
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
426+
Negligible amount of CPU and memory. Because the API is purely read-only and piggy-backs on the kubelet’s existing cache and checkpointing machinery, exposing Dynamic Resources incurs only similar minimal serialization and storage as CPUManager and DeviceManager—so any extra CPU, memory, disk, or I/O impact is negligible.
427+
428+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
418429

419-
DDOSing the API can lead to resource exhaustion.
430+
No, because the endpoint queries existing data structures inside the kubelet.
420431

421432
### Troubleshooting
422433

423434
###### How does this feature react if the API server and/or etcd is unavailable?
424435

425-
N/A.
436+
No impact, the feature is node-local.
426437

427438
###### What are other known failure modes?
428439

429-
The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. Bugs may cause the API to return unexpected errors, or to return inconsistent data. Consumers of the API should treat unexpected errors as bugs of this API.
440+
feature gate disabled: The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available.
441+
Bugs may cause the API to return unexpected errors, or to return inconsistent data.
442+
Consumers of the API should treat unexpected errors as bugs of this API.
430443

431444
###### What steps should be taken if SLOs are not being met to determine the problem?
432445

433-
N/A.
446+
Check the error code to learn if the consumer of the API is being throttle by rate limiting introduced in the parent KEP 606.
447+
Check the kubelet logs to learn about resource allocation errors.
434448

435449
## Implementation History
436450

437451
- 2023-01-12: KEP created
438452

439453
- 2024-09-10: KEP Updated to reflect the current state of the implementation.
440454

455+
- 2025-05-27: Beta version of the KEP.
456+
441457
## Drawbacks
442458

459+
N/A
460+
443461
## Alternatives
462+
463+
N/A

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ authors:
44
- "@moshe010"
55
owning-sig: sig-node
66
participating-sigs: []
7-
status: provisional
8-
creation-date: implementable
7+
status: implementable
8+
creation-date: 2023-02-07
99
reviewers:
1010
- "@ffromani"
1111
- "@swatisehgal"
@@ -18,17 +18,17 @@ see-also:
1818
replaces: []
1919

2020
# The target maturity stage in the current dev cycle for this KEP.
21-
stage: alpha
21+
stage: beta
2222

2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.27"
26+
latest-milestone: "v1.34"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:
3030
alpha: "v1.27"
31-
beta: "v1.33"
31+
beta: "v1.34"
3232
stable: "v1.36"
3333

3434
# The following PRR answers are required at alpha release

0 commit comments

Comments
 (0)