Skip to content

Commit 49db474

Browse files
committed
5018-update beta
Signed-off-by: Rita Zhang <[email protected]>
1 parent 4c66da9 commit 49db474

File tree

3 files changed

+51
-16
lines changed

3 files changed

+51
-16
lines changed

keps/prod-readiness/sig-auth/5018.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 5018
55
alpha:
66
approver: "soltysh"
7+
beta:
8+
approver: "soltysh"

keps/sig-auth/5018-dra-adminaccess/README.md

Lines changed: 47 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,7 @@ ResourceClaimTemplate and ResourceClaim for admin access
466466
467467
- Gather feedback
468468
- Additional tests are in Testgrid and linked in KEP
469-
- Implementations in the kubernetes-sigs/dra-example-driver
469+
- Implementations in the kubernetes-sigs/dra-example-driver: https://github.com/kubernetes-sigs/dra-example-driver/issues/97 and the NVIDIA dra driver: https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/337
470470
471471
#### GA
472472
@@ -541,7 +541,12 @@ rollout. Similarly, consider large clusters and how enablement/disablement
541541
will rollout across nodes.
542542
-->
543543

544-
Will be considered for beta.
544+
- kube-controller-manager: If the kube-controller-manager fails to create `ResourceClaim` objects from `ResourceClaimTemplate` due to misconfigurations or permission issues relating to `adminAccess`, then the associated Pods will remain in a pending state and won't be scheduled.
545+
- kube-scheduler: Bugs in the scheduler might lead to Pods not being scheduled even when resources are available or, scheduling Pods that shouldn't be scheduled due to unmet `adminAccess` requirements. If the `DRAAdminAccess` feature gate isn't enabled or is misconfigured, the scheduler might not recognize ResourceClaim requirements, leading to scheduling failures.
546+
- Workloads Without `ResourceClaims` will remain unaffected as the adminAccess feature doesn't interact with them. The new code paths introduced for adminAccess only engage when `ResourceClaims` are present in the Pod specification.
547+
- New Pods requiring `ResourceClaims` with `adminAccess` might remain unscheduled if the control plane components fail to process the claims correctly.
548+
- Existing Pods continue to run unaffected since `ResourceClaim` and `ResourceClaimTemplate`'s spec is immutable, including the adminAccess field, cannot be altered.
549+
545550

546551
###### What specific metrics should inform a rollback?
547552

@@ -557,8 +562,6 @@ the `scheduler_pending_pods` metric in the kube-scheduler or an increase in the
557562
Further analysis by reviewing logs and pod events is needed to determine whether
558563
errors are related to this feature.
559564

560-
Will provide more details for beta.
561-
562565
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
563566

564567
<!--
@@ -567,15 +570,19 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
567570
are missing a bunch of machinery and tooling and can't do that now.
568571
-->
569572

570-
Will be considered for beta.
573+
This will be done manually before transition to beta by bringing up a cluster with kubeadm and changing the feature gate for individual components.
574+
575+
Manual upgrade of the control plane to a version with the feature enabled will be tested. Existing pods not using the feature remained running. Creation of new pods and ResourceClaims that do not use the feature should be unaffected.
576+
577+
Manual downgrade of the control plane to a version with the feature disabled was tested. Existing pods using the feature remained running. Creation of new pods and ResourceClaims that use the feature should be blocked.
571578

572579
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
573580

574581
<!--
575582
Even if applying deprecation policies, they may still surprise some users.
576583
-->
577584

578-
Will be considered for beta.
585+
No.
579586

580587
### Monitoring Requirements
581588

@@ -586,7 +593,7 @@ For GA, this section is required: approvers should be able to confirm the
586593
previous answers based on experience in the field.
587594
-->
588595

589-
Will be considered for beta.
596+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
590597

591598
###### How can an operator determine if the feature is in use by workloads?
592599

@@ -596,7 +603,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
596603
logs or events for this purpose.
597604
-->
598605

599-
Will be considered for beta.
606+
".status.allocation.devices.results[*].adminaccess" will be set to true for a claim using adminAccess when needed by a pod.
607+
608+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
600609

601610
###### How can someone using this feature know that it is working for their instance?
602611

@@ -640,7 +649,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
640649
question.
641650
-->
642651

643-
Will be considered for beta.
652+
SLO: 100% of unauthorized access attempts are denied.
644653

645654
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
646655

@@ -673,14 +682,17 @@ metric in scheduler will identify pods that are currently unschedulable because
673682
of the `DynamicResources` plugin or a misconfiguration of the `AdminAccess`
674683
field.
675684

685+
Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
686+
field.
687+
676688
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
677689

678690
<!--
679691
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
680692
implementation difficulties, etc.).
681693
-->
682694

683-
Will be considered for beta.
695+
No
684696

685697
### Dependencies
686698

@@ -705,7 +717,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
705717
- Impact of its degraded performance or high-error rates on the feature:
706718
-->
707719

708-
Will be considered for beta.
720+
- The DynamicResourceAllocation feature gate must be enabled to create ResourceClaim, ResourceClaimTemplate. More details at [KEP-4381 - DRA Structured Parameters](https://github.com/kubernetes/enhancements/issues/4381)
721+
- A third-party DRA driver is required for how the driver should interpret the AdminAcess field to get acess to device specific resources without allocating them.
709722

710723
### Scalability
711724

@@ -755,7 +768,7 @@ details). For now, we leave it here.
755768

756769
###### How does this feature react if the API server and/or etcd is unavailable?
757770

758-
Will be considered for beta.
771+
The Kubernetes control plane will be down, so no new ResourceClaim or ResourceClaimTemplate will be created.
759772

760773
###### What are other known failure modes?
761774

@@ -772,15 +785,35 @@ For each of them, fill in the following information by copying the below templat
772785
- Testing: Are there any tests for failure mode? If not, describe why.
773786
-->
774787

775-
Will be considered for beta.
788+
- kube-scheduler cannot allocate ResourceClaims with AdminAccess.
789+
790+
- Detection: When pods fail to get scheduled, kube-scheduler reports that
791+
through events and pod status. For DRA, messages include "cannot allocate
792+
all claims" (insufficient resources) and "ResourceClaim not created yet"
793+
(user or kube-controller-manager haven't created the ResourceClaim yet).
794+
The
795+
["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206)
796+
metric will have pods counted under the "dynamicresources" plugin label.
797+
798+
To troubleshoot, "kubectl describe" can be used on (in this order) Pod
799+
and ResourceClaim.
800+
801+
- Mitigations: When ResourceClaims or ResourceClaimTemplates the `AdminAccess`
802+
field don't get created, debugging should focus on the namespace labels. The kube-controller-manager logs should have more information.
803+
804+
- Diagnostics: Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
805+
field.
806+
807+
- Testing: E2E testing covers scenarios that successfully created ResourceClaims and ResourceClaimTemplates with the `AdminAccess` field in admin namespace and denied attempts in non-admin namespace.
776808

777809
###### What steps should be taken if SLOs are not being met to determine the problem?
778810

779-
Will be considered for beta.
811+
If SLOs are not being met, not all 100% of unauthorized access attempts are denied. Debugging to determine the problem should review the namespace labels to verify correctness.
780812

781813
## Implementation History
782814

783815
- Kubernetes 1.33: Alpha version of the KEP.
816+
- Kubernetes 1.34: Beta version of the KEP.
784817

785818
## Drawbacks
786819

keps/sig-auth/5018-dra-adminaccess/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ see-also:
1717
- "/keps/sig-node/4381-dra-structured-parameters"
1818

1919
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: alpha
20+
stage: beta
2121

2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.33"
25+
latest-milestone: "v1.34"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:

0 commit comments

Comments
 (0)