@@ -518,7 +518,8 @@ Based on reviewers feedback describe what additional tests need to be added prio
518
518
implementing this enhancement to ensure the enhancements have also solid foundations.
519
519
-->
520
520
521
- TBD.
521
+ We assess that since the proposed controller resides out-of-tree, no
522
+ prerequisite testing suites need to be updated.
522
523
523
524
##### Unit tests
524
525
@@ -541,7 +542,11 @@ This can inform certain test coverage improvements that we want to do before
541
542
extending the production code to implement this enhancement.
542
543
-->
543
544
544
- TBD.
545
+ ` ` `
546
+ - `internal` : 20/1/2025 - 21.1% of statements
547
+ - `pkg/apis/resourcestatemetrics/v1alpha1` : 20/1/2025 - 18.3% of statements
548
+ - `pkg/resolver` : 20/1/2025 - 47.8% of statements
549
+ ` ` `
545
550
546
551
##### Integration tests
547
552
@@ -574,7 +579,13 @@ https://storage.googleapis.com/k8s-triage/index.html
574
579
We expect no non-infra related flakes in the last month as a GA graduation criteria.
575
580
-->
576
581
577
- - ` resourcestatemetrics_test`: https://github.com/rexagod/resource-state-metrics/tree/main/tests
582
+ End-to-end tests reside in the [` resourcestatemetrics_test`] package, and aim to test the following behaviors:
583
+ - conformance with the existing Custom Resource State API ([kubernetes/kube-state-metrics#1710]) feature-set,
584
+ - handling all supported events on the managed resource, which should exhibit deterministic outcomes for each of them,
585
+ - fuzzing the configuration with invalid values to ensure that the controller does not crash, and
586
+ - fuzzing the configuration with valid values to ensure there are no side effects.
587
+
588
+ [`resourcestatemetrics_test`] : https://github.com/rexagod/resource-state-metrics/tree/main/tests
578
589
579
590
# ## Graduation Criteria
580
591
@@ -654,7 +665,13 @@ enhancement:
654
665
cluster required to make on upgrade, in order to make use of the enhancement?
655
666
-->
656
667
657
- N/A.
668
+ ` ResourceMetricsMonitor` API(s) will follow the [hub-spoke] interconversion
669
+ model, and as such, an outdated hub or spoke version will be able to upgrade to
670
+ the latest spoke version when the controller itself is upgraded. This will in
671
+ turn convert the spoke to the hub version, and the hub version into the
672
+ latest available spoke version, depending on the controller's version.
673
+
674
+ [hub-spoke] : https://www.kubebuilder.io/multiversion-tutorial/conversion-concepts
658
675
659
676
# ## Version Skew Strategy
660
677
@@ -706,8 +723,6 @@ you need any help or guidance.
706
723
This section must be completed when targeting alpha to a release.
707
724
-->
708
725
709
- N/A.
710
-
711
726
# ##### How can this feature be enabled / disabled in a live cluster?
712
727
713
728
<!--
@@ -730,7 +745,16 @@ well as the [existing list] of feature gates.
730
745
of a node?
731
746
-->
732
747
733
- N/A.
748
+ - [X] Other
749
+ - Describe the mechanism : The controller can be enabled by deploying it in the
750
+ cluster, and disabled by deleting it. The managed resources may be left in
751
+ the cluster to be picked up by a future deployment. However, if that's not
752
+ intended, cluster-admins will need to drop them separately (either all CRs
753
+ or the defining CRD).
754
+ - Will enabling / disabling the feature require downtime of the control
755
+ plane? No.
756
+ - Will enabling / disabling the feature require downtime or reprovisioning
757
+ of a node? No.
734
758
735
759
# ##### Does enabling the feature change any default behavior?
736
760
@@ -756,17 +780,20 @@ feature.
756
780
NOTE : Also set `disable-supported` to `true` or `false` in `kep.yaml`.
757
781
-->
758
782
759
- Removing the controller will not remove its managed resources. They will be
760
- dropped if the corresponding `ResourceMetricsMonitor` CRD is deleted. Similarly,
761
- removing all managed resources will not remove the controller.
783
+ Yes, cluster-admins can roll back the controller by simply deleting it, as there
784
+ are no finalizers or owner references linking back to the controller (this may
785
+ be [subject to change] at a later stage). Consequently, doing so does not remove
786
+ the managed resources automatically, which will need to be dropped separately
787
+ (or by dropping the defining CRD).
762
788
763
- The reason why it is so is because cluster-scoped resources cannot have owner
789
+ The reason why it is so is that cluster-scoped resources cannot have owner
764
790
references linking back to namespace-scoped resources, in which case, the
765
791
garbage collection is a no-op. See [819a80] for more details.
766
792
793
+ [subject to change] : https://github.com/rexagod/resource-state-metrics#todo
767
794
[819a80] : https://github.com/rexagod/resource-state-metrics/commit/819a8001200a13c51cb82779c139a9081b4d613b
768
795
769
- # ##### What happens if we reenable the feature if it was previously rolled back?
796
+ # ##### What happens if we re-enable the feature if it was previously rolled back?
770
797
771
798
In the context of the controller, pre-existing managed resources in the cluster
772
799
will be picked back up after a (re-)deploy.
@@ -811,8 +838,11 @@ rollout. Similarly, consider large clusters and how enablement/disablement
811
838
will rollout across nodes.
812
839
-->
813
840
814
- In the context of the controller, failed controller deployments do not make any
815
- changes to the existing managed resources' metric configurations.
841
+ In the context of the controller, failed controller deployments simply do not
842
+ make any changes to the existing managed resources' metric configurations
843
+ (`spec.configuration`), and as such, the metrics will again be available once a
844
+ deployed instance is up and discovers all existing managed resources in the
845
+ cluster. But long as that's not healthy, the metrics exposition will be absent.
816
846
817
847
# ##### What specific metrics should inform a rollback?
818
848
@@ -821,7 +851,22 @@ What signals should users be paying attention to when the feature is young
821
851
that might indicate a serious problem?
822
852
-->
823
853
824
- The controller has a telemetry server to diagnose and query statistics.
854
+ At the moment any disruptions will be indicated by the managed resources' `status` field, like so,
855
+
856
+ ```
857
+ status:
858
+ conditions:
859
+ - lastTransitionTime: "2024-08-27T19:46:13Z"
860
+ message: 'Resource failed to process'
861
+ observedGeneration: 1
862
+ reason: EventHandlerFailed
863
+ status: True
864
+ type: Failed
865
+ ```
866
+
867
+ Additionally, diagnostic metrics and traces are [planned] for the near-future.
868
+
869
+ [planned]: https://github.com/rexagod/resource-state-metrics#todo
825
870
826
871
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
827
872
@@ -831,9 +876,7 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
831
876
are missing a bunch of machinery and tooling and can't do that now.
832
877
-->
833
878
834
- Yes, controller deploy, scale down, and scale up works fine. Operations on
835
- managed resources, if any, in the future, will be taken care of using the
836
- aforementioned [hub-spoke] interconversion model.
879
+ TBD (when targeting beta).
837
880
838
881
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
839
882
@@ -854,8 +897,6 @@ For GA, this section is required: approvers should be able to confirm the
854
897
previous answers based on experience in the field.
855
898
-->
856
899
857
- N/A.
858
-
859
900
###### How can an operator determine if the feature is in use by workloads?
860
901
861
902
<!--
@@ -864,7 +905,7 @@ checking if there are objects with field X set) may be a last resort. Avoid
864
905
logs or events for this purpose.
865
906
-->
866
907
867
- N/A .
908
+ This is not nota workload feature, but an out-of-tree telemetry solution .
868
909
869
910
###### How can someone using this feature know that it is working for their instance?
870
911
@@ -941,13 +982,7 @@ TBD.
941
982
This section must be completed when targeting beta to a release.
942
983
-->
943
984
944
- The controller __directly__ imports the following `k8s.io` packages :
945
- * `k8s.io/api`
946
- * `k8s.io/apimachinery`
947
- * `k8s.io/client-go`
948
- * `k8s.io/code-generator`
949
- * `k8s.io/klog/v2`
950
- * `k8s.io/utils`
985
+ TBD (when targeting beta).
951
986
952
987
###### Does this feature depend on any specific services running in the cluster?
953
988
@@ -966,7 +1001,18 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
966
1001
- Impact of its degraded performance or high-error rates on the feature:
967
1002
-->
968
1003
969
- The controller relies on core Kubernetes components.
1004
+ The controller relies on core Kubernetes serving-stack components:
1005
+ - ` kube-apiserver ` and ` etcd `
1006
+ - Usage description: The controller needs to make API calls in order to
1007
+ reconcile the managed resources in the cluster.
1008
+ - Impact of its outage on the feature: The controller will log errors but
1009
+ keep reconciling for when ` etcd ` comes back up.
1010
+ - Impact of its degraded performance or high-error rates on the feature: Same
1011
+ as above.
1012
+
1013
+ There's an indirect dependency on critical components such as
1014
+ ` kube-controller-manager ` and ` kubelet ` , which have been left out above for the
1015
+ sake of brevity.
970
1016
971
1017
### Scalability
972
1018
@@ -1036,8 +1082,8 @@ Describe them, providing:
1036
1082
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
1037
1083
-->
1038
1084
1039
- The controller will **not** create or modify any object on its own. Only when a
1040
- managed resource is deployed will it try to reconcile it .
1085
+ The controller will ** not** create any object on its own. Only when a managed
1086
+ resource is deployed will it try to reconcile (modify) it, if needed .
1041
1087
1042
1088
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
1043
1089
0 commit comments