You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-auth/5018-dra-adminaccess/README.md
+47-14Lines changed: 47 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -466,7 +466,7 @@ ResourceClaimTemplate and ResourceClaim for admin access
466
466
467
467
- Gather feedback
468
468
- Additional tests are in Testgrid and linked in KEP
469
-
- Implementations in the kubernetes-sigs/dra-example-driver
469
+
- Implementations in the kubernetes-sigs/dra-example-driver: https://github.com/kubernetes-sigs/dra-example-driver/issues/97 and the NVIDIA dra driver: https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/337
470
470
471
471
#### GA
472
472
@@ -541,7 +541,12 @@ rollout. Similarly, consider large clusters and how enablement/disablement
541
541
will rollout across nodes.
542
542
-->
543
543
544
-
Will be considered for beta.
544
+
- kube-controller-manager: If the kube-controller-manager fails to create `ResourceClaim` objects from `ResourceClaimTemplate` due to misconfigurations or permission issues relating to `adminAccess`, then the associated Pods will remain in a pending state and won't be scheduled.
545
+
- kube-scheduler: Bugs in the scheduler might lead to Pods not being scheduled even when resources are available or, scheduling Pods that shouldn't be scheduled due to unmet `adminAccess` requirements. If the `DRAAdminAccess` feature gate isn't enabled or is misconfigured, the scheduler might not recognize ResourceClaim requirements, leading to scheduling failures.
546
+
- Workloads Without `ResourceClaims` will remain unaffected as the adminAccess feature doesn't interact with them. The new code paths introduced for adminAccess only engage when `ResourceClaims` are present in the Pod specification.
547
+
- New Pods requiring `ResourceClaims` with `adminAccess` might remain unscheduled if the control plane components fail to process the claims correctly.
548
+
- Existing Pods continue to run unaffected since `ResourceClaim` and `ResourceClaimTemplate`'s spec is immutable, including the adminAccess field, cannot be altered.
549
+
545
550
546
551
###### What specific metrics should inform a rollback?
547
552
@@ -557,8 +562,6 @@ the `scheduler_pending_pods` metric in the kube-scheduler or an increase in the
557
562
Further analysis by reviewing logs and pod events is needed to determine whether
558
563
errors are related to this feature.
559
564
560
-
Will provide more details for beta.
561
-
562
565
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
563
566
564
567
<!--
@@ -567,15 +570,19 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
567
570
are missing a bunch of machinery and tooling and can't do that now.
568
571
-->
569
572
570
-
Will be considered for beta.
573
+
This will be done manually before transition to beta by bringing up a cluster with kubeadm and changing the feature gate for individual components.
574
+
575
+
Manual upgrade of the control plane to a version with the feature enabled will be tested. Existing pods not using the feature remained running. Creation of new pods and ResourceClaims that do not use the feature should be unaffected.
576
+
577
+
Manual downgrade of the control plane to a version with the feature disabled was tested. Existing pods using the feature remained running. Creation of new pods and ResourceClaims that use the feature should be blocked.
571
578
572
579
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
573
580
574
581
<!--
575
582
Even if applying deprecation policies, they may still surprise some users.
576
583
-->
577
584
578
-
Will be considered for beta.
585
+
No.
579
586
580
587
### Monitoring Requirements
581
588
@@ -586,7 +593,7 @@ For GA, this section is required: approvers should be able to confirm the
586
593
previous answers based on experience in the field.
587
594
-->
588
595
589
-
Will be considered for beta.
596
+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
590
597
591
598
###### How can an operator determine if the feature is in use by workloads?
592
599
@@ -596,7 +603,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
596
603
logs or events for this purpose.
597
604
-->
598
605
599
-
Will be considered for beta.
606
+
".status.allocation.devices.results[*].adminaccess"will be set to true for a claim using adminAccess when needed by a pod.
607
+
608
+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
600
609
601
610
###### How can someone using this feature know that it is working for their instance?
602
611
@@ -640,7 +649,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
640
649
question.
641
650
-->
642
651
643
-
Will be considered for beta.
652
+
SLO: 100% of unauthorized access attempts are denied.
644
653
645
654
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
646
655
@@ -673,14 +682,17 @@ metric in scheduler will identify pods that are currently unschedulable because
673
682
of the `DynamicResources` plugin or a misconfiguration of the `AdminAccess`
674
683
field.
675
684
685
+
Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
686
+
field.
687
+
676
688
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
677
689
678
690
<!--
679
691
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
680
692
implementation difficulties, etc.).
681
693
-->
682
694
683
-
Will be considered for beta.
695
+
No
684
696
685
697
### Dependencies
686
698
@@ -705,7 +717,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
705
717
- Impact of its degraded performance or high-error rates on the feature:
706
718
-->
707
719
708
-
Will be considered for beta.
720
+
- The DynamicResourceAllocation feature gate must be enabled to create ResourceClaim, ResourceClaimTemplate. More details at [KEP-4381 - DRA Structured Parameters](https://github.com/kubernetes/enhancements/issues/4381)
721
+
- A third-party DRA driver is required for how the driver should interpret the AdminAcess field to get acess to device specific resources without allocating them.
709
722
710
723
### Scalability
711
724
@@ -755,7 +768,7 @@ details). For now, we leave it here.
755
768
756
769
###### How does this feature react if the API server and/or etcd is unavailable?
757
770
758
-
Will be considered for beta.
771
+
The Kubernetes control plane will be down, so no new ResourceClaim or ResourceClaimTemplate will be created.
759
772
760
773
###### What are other known failure modes?
761
774
@@ -772,15 +785,35 @@ For each of them, fill in the following information by copying the below templat
772
785
- Testing: Are there any tests for failure mode? If not, describe why.
773
786
-->
774
787
775
-
Will be considered for beta.
788
+
- kube-scheduler cannot allocate ResourceClaims with AdminAccess.
789
+
790
+
- Detection: When pods fail to get scheduled, kube-scheduler reports that
791
+
through events and pod status. For DRA, messages include "cannot allocate
792
+
all claims" (insufficient resources) and "ResourceClaim not created yet"
793
+
(user or kube-controller-manager haven't created the ResourceClaim yet).
metric will have pods counted under the "dynamicresources" plugin label.
797
+
798
+
To troubleshoot, "kubectl describe" can be used on (in this order) Pod
799
+
and ResourceClaim.
800
+
801
+
- Mitigations: When ResourceClaims or ResourceClaimTemplates the `AdminAccess`
802
+
field don't get created, debugging should focus on the namespace labels. The kube-controller-manager logs should have more information.
803
+
804
+
- Diagnostics: Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
805
+
field.
806
+
807
+
- Testing: E2E testing covers scenarios that successfully created ResourceClaims and ResourceClaimTemplates with the `AdminAccess` field in admin namespace and denied attempts in non-admin namespace.
776
808
777
809
###### What steps should be taken if SLOs are not being met to determine the problem?
778
810
779
-
Will be considered for beta.
811
+
If SLOs are not being met, not all 100% of unauthorized access attempts are denied. Debugging to determine the problem should review the namespace labels to verify correctness.
0 commit comments