Skip to content

Commit a1a9a7d

Browse files
committed
fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! KEP-4785: CRDMetrics Controller
1 parent 39b8965 commit a1a9a7d

File tree

2 files changed

+78
-32
lines changed
  • keps
    • prod-readiness/sig-instrumentation
    • sig-instrumentation/4785-resource-state-metrics

2 files changed

+78
-32
lines changed

keps/prod-readiness/sig-instrumentation/4785.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@
33
# of http://git.k8s.io/enhancements/OWNERS_ALIASES
44
kep-number: 4785
55
alpha:
6-
approver: "@johnbelamaric"
6+
approver: "@wojtek-t"

keps/sig-instrumentation/4785-resource-state-metrics/README.md

Lines changed: 77 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -518,7 +518,8 @@ Based on reviewers feedback describe what additional tests need to be added prio
518518
implementing this enhancement to ensure the enhancements have also solid foundations.
519519
-->
520520
521-
TBD.
521+
We assess that since the proposed controller resides out-of-tree, no
522+
prerequisite testing suites need to be updated.
522523
523524
##### Unit tests
524525
@@ -541,7 +542,11 @@ This can inform certain test coverage improvements that we want to do before
541542
extending the production code to implement this enhancement.
542543
-->
543544
544-
TBD.
545+
```
546+
- `internal`: 20/1/2025 - 21.1% of statements
547+
- `pkg/apis/resourcestatemetrics/v1alpha1`: 20/1/2025 - 18.3% of statements
548+
- `pkg/resolver`: 20/1/2025 - 47.8% of statements
549+
```
545550
546551
##### Integration tests
547552
@@ -574,7 +579,13 @@ https://storage.googleapis.com/k8s-triage/index.html
574579
We expect no non-infra related flakes in the last month as a GA graduation criteria.
575580
-->
576581
577-
- `resourcestatemetrics_test`: https://github.com/rexagod/resource-state-metrics/tree/main/tests
582+
End-to-end tests reside in the [`resourcestatemetrics_test`] package, and aim to test the following behaviors:
583+
- conformance with the existing Custom Resource State API ([kubernetes/kube-state-metrics#1710]) feature-set,
584+
- handling all supported events on the managed resource, which should exhibit deterministic outcomes for each of them,
585+
- fuzzing the configuration with invalid values to ensure that the controller does not crash, and
586+
- fuzzing the configuration with valid values to ensure there are no side effects.
587+
588+
[`resourcestatemetrics_test`]: https://github.com/rexagod/resource-state-metrics/tree/main/tests
578589

579590
### Graduation Criteria
580591

@@ -654,7 +665,13 @@ enhancement:
654665
cluster required to make on upgrade, in order to make use of the enhancement?
655666
-->
656667

657-
N/A.
668+
`ResourceMetricsMonitor` API(s) will follow the [hub-spoke] interconversion
669+
model, and as such, an outdated hub or spoke version will be able to upgrade to
670+
the latest spoke version when the controller itself is upgraded. This will in
671+
turn convert the spoke to the hub version, and the hub version into the
672+
latest available spoke version, depending on the controller's version.
673+
674+
[hub-spoke]: https://www.kubebuilder.io/multiversion-tutorial/conversion-concepts
658675

659676
### Version Skew Strategy
660677

@@ -706,8 +723,6 @@ you need any help or guidance.
706723
This section must be completed when targeting alpha to a release.
707724
-->
708725

709-
N/A.
710-
711726
###### How can this feature be enabled / disabled in a live cluster?
712727

713728
<!--
@@ -730,7 +745,16 @@ well as the [existing list] of feature gates.
730745
of a node?
731746
-->
732747

733-
N/A.
748+
- [X] Other
749+
- Describe the mechanism: The controller can be enabled by deploying it in the
750+
cluster, and disabled by deleting it. The managed resources may be left in
751+
the cluster to be picked up by a future deployment. However, if that's not
752+
intended, cluster-admins will need to drop them separately (either all CRs
753+
or the defining CRD).
754+
- Will enabling / disabling the feature require downtime of the control
755+
plane? No.
756+
- Will enabling / disabling the feature require downtime or reprovisioning
757+
of a node? No.
734758

735759
###### Does enabling the feature change any default behavior?
736760

@@ -756,17 +780,20 @@ feature.
756780
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
757781
-->
758782

759-
Removing the controller will not remove its managed resources. They will be
760-
dropped if the corresponding `ResourceMetricsMonitor` CRD is deleted. Similarly,
761-
removing all managed resources will not remove the controller.
783+
Yes, cluster-admins can roll back the controller by simply deleting it, as there
784+
are no finalizers or owner references linking back to the controller (this may
785+
be [subject to change] at a later stage). Consequently, doing so does not remove
786+
the managed resources automatically, which will need to be dropped separately
787+
(or by dropping the defining CRD).
762788

763-
The reason why it is so is because cluster-scoped resources cannot have owner
789+
The reason why it is so is that cluster-scoped resources cannot have owner
764790
references linking back to namespace-scoped resources, in which case, the
765791
garbage collection is a no-op. See [819a80] for more details.
766792

793+
[subject to change]: https://github.com/rexagod/resource-state-metrics#todo
767794
[819a80]: https://github.com/rexagod/resource-state-metrics/commit/819a8001200a13c51cb82779c139a9081b4d613b
768795

769-
###### What happens if we reenable the feature if it was previously rolled back?
796+
###### What happens if we re-enable the feature if it was previously rolled back?
770797

771798
In the context of the controller, pre-existing managed resources in the cluster
772799
will be picked back up after a (re-)deploy.
@@ -811,8 +838,11 @@ rollout. Similarly, consider large clusters and how enablement/disablement
811838
will rollout across nodes.
812839
-->
813840

814-
In the context of the controller, failed controller deployments do not make any
815-
changes to the existing managed resources' metric configurations.
841+
In the context of the controller, failed controller deployments simply do not
842+
make any changes to the existing managed resources' metric configurations
843+
(`spec.configuration`), and as such, the metrics will again be available once a
844+
deployed instance is up and discovers all existing managed resources in the
845+
cluster. But long as that's not healthy, the metrics exposition will be absent.
816846

817847
###### What specific metrics should inform a rollback?
818848

@@ -821,7 +851,22 @@ What signals should users be paying attention to when the feature is young
821851
that might indicate a serious problem?
822852
-->
823853

824-
The controller has a telemetry server to diagnose and query statistics.
854+
At the moment any disruptions will be indicated by the managed resources' `status` field, like so,
855+
856+
```
857+
status:
858+
conditions:
859+
- lastTransitionTime: "2024-08-27T19:46:13Z"
860+
message: 'Resource failed to process'
861+
observedGeneration: 1
862+
reason: EventHandlerFailed
863+
status: True
864+
type: Failed
865+
```
866+
867+
Additionally, diagnostic metrics and traces are [planned] for the near-future.
868+
869+
[planned]: https://github.com/rexagod/resource-state-metrics#todo
825870
826871
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
827872
@@ -831,9 +876,7 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
831876
are missing a bunch of machinery and tooling and can't do that now.
832877
-->
833878
834-
Yes, controller deploy, scale down, and scale up works fine. Operations on
835-
managed resources, if any, in the future, will be taken care of using the
836-
aforementioned [hub-spoke] interconversion model.
879+
TBD (when targeting beta).
837880
838881
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
839882
@@ -854,8 +897,6 @@ For GA, this section is required: approvers should be able to confirm the
854897
previous answers based on experience in the field.
855898
-->
856899
857-
N/A.
858-
859900
###### How can an operator determine if the feature is in use by workloads?
860901
861902
<!--
@@ -864,7 +905,7 @@ checking if there are objects with field X set) may be a last resort. Avoid
864905
logs or events for this purpose.
865906
-->
866907
867-
N/A.
908+
This is not nota workload feature, but an out-of-tree telemetry solution.
868909
869910
###### How can someone using this feature know that it is working for their instance?
870911
@@ -941,13 +982,7 @@ TBD.
941982
This section must be completed when targeting beta to a release.
942983
-->
943984

944-
The controller __directly__ imports the following `k8s.io` packages:
945-
* `k8s.io/api`
946-
* `k8s.io/apimachinery`
947-
* `k8s.io/client-go`
948-
* `k8s.io/code-generator`
949-
* `k8s.io/klog/v2`
950-
* `k8s.io/utils`
985+
TBD (when targeting beta).
951986

952987
###### Does this feature depend on any specific services running in the cluster?
953988

@@ -966,7 +1001,18 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
9661001
- Impact of its degraded performance or high-error rates on the feature:
9671002
-->
9681003

969-
The controller relies on core Kubernetes components.
1004+
The controller relies on core Kubernetes serving-stack components:
1005+
- `kube-apiserver` and `etcd`
1006+
- Usage description: The controller needs to make API calls in order to
1007+
reconcile the managed resources in the cluster.
1008+
- Impact of its outage on the feature: The controller will log errors but
1009+
keep reconciling for when `etcd` comes back up.
1010+
- Impact of its degraded performance or high-error rates on the feature: Same
1011+
as above.
1012+
1013+
There's an indirect dependency on critical components such as
1014+
`kube-controller-manager` and `kubelet`, which have been left out above for the
1015+
sake of brevity.
9701016

9711017
### Scalability
9721018

@@ -1036,8 +1082,8 @@ Describe them, providing:
10361082
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
10371083
-->
10381084

1039-
The controller will **not** create or modify any object on its own. Only when a
1040-
managed resource is deployed will it try to reconcile it.
1085+
The controller will **not** create any object on its own. Only when a managed
1086+
resource is deployed will it try to reconcile (modify) it, if needed.
10411087

10421088
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
10431089

0 commit comments

Comments
 (0)