feat: add status to the ModelValidation CRs, tracking pod and modelva… #47

knrc · 2025-08-02T17:37:35Z

…lidations

Summary

This PR closes #22, adding support for status updates within the ModelValidation. The status tracks a number of items

injectedPodCount (the number of pods associated with the current ModelValidation)
uninjectedPodCount (the number of pods labelled for the current ModelValidation, but without having been injected)
orphanedPodCount (the number of pods which were injected, but now orphaned because the ModelValidation configuration changed)
a list of pod information for each of the above

The PR introduces controllers for tracking pods and modelvalidations, with a separate status tracker for coordinating the tracking and for updating status. Status updates for the ModelValidation resources are debounced, with backoff and retries.

This PR also includes modifications to the CRD printer columns to include the count, for example

NAMESPACE             NAME             AUTH METHOD   INJECTED PODS   UNINJECTED PODS   ORPHANED PODS   AGE
e2e-webhook-test-ns   e2e-test-model                                                                   0s
e2e-webhook-test-ns   e2e-test-model   sigstore      1               0                 0               0s

Unit tests have been written, but we are still missing end to end tests and metrics. The end to end tests and metrics will be handled in separate PRs, as this one is already sizeable.

Release Note

Added status tracking for ModelValidation resources, to track pods which are associated with individual resources. The status of each resource contains

the number of pods associated with the current configuration
the number of pods associated with a previous configuration
the number of pods associated with the ModelValidation, but not injected for validation
Lists of pods for each of the previous counts

Documentation

bouskaJ · 2025-08-06T12:24:20Z

Hello. I did a quick review, as I didn't have much time to look into it further yet. One thing came to mind. You're implementing a retry function with backoff. Isn't it possible to use something from https://github.com/kubernetes/client-go/blob/master/util/retry/util.go instead?

knrc · 2025-08-06T16:05:32Z

Hello. I did a quick review, as I didn't have much time to look into it further yet. One thing came to mind. You're implementing a retry function with backoff. Isn't it possible to use something from https://github.com/kubernetes/client-go/blob/master/util/retry/util.go instead?

That's slightly different, it's a synchronous way of handling the retries so would block any other processing taking place.

The intention of the current retry implementation was to do asynchronous retries with backoffs, but I've just realized that by pushing it back through the debouncer to merge with any other updates I'm losing that. I'll fix it and push up the changes shortly.

Thanks for the feedback

knrc · 2025-08-07T00:05:00Z

The intention of the current retry implementation was to do asynchronous retries with backoffs, but I've just realized that by pushing it back through the debouncer to merge with any other updates I'm losing that. I'll fix it and push up the changes shortly.

@bouskaJ While fixing this I realized that the workqueue has an option for handling exponential failures, so I've rewritten the code to use that and added a couple of test cases just to make sure it's behaving the way I expect. Please take another look, you can check the differences here

internal/controller/pod_controller.go

osmman · 2025-08-07T07:30:11Z

api/v1alpha1/modelvalidation_types.go

+}
+
+// GetConfigHash returns a hash of the validation configuration for drift detection
+func (mv *ModelValidation) GetConfigHash() string {


Hi, instead of creating a custom hashing method for drift detection, it's best to use Kubernetes build-in fields. These are more robust and are the standard pattern for operators.

Use .metadata.generation and a status field like status.observedGeneration or ObservedGeneration field within a Condition struct. The reconciliation loop should follow this logic:

On each reconcile, compare .metadata.generation with status.observedGeneration.

If the values don't match, it indicates the spec has been updated and a reconciliation is needed.

After a successful reconciliation, update status.observedGeneration to equal .metadata.generation.

Using .metadata.resourceVersion is another option, but generation is typically preferred for tracking spec-only changes, as resourceVersion changes on every update, including status updates.

The hashing is not just about detecting drift, it's also used for tracking within PodInfo so I know which specific configuration applied to the pods. I have no problem adding the observed generation in addition to the hash, as a quick sanity check, but not replacing the hash. I wouldn't want to use resourceVersion, as that comes from etcd and represents every change to the resource.

@osmman I've updated the code to add the generation check, so we don't need to redo the hash if that hasn't changed. I also tidied up some older code, it was still tracking resourceVersion but I never used it.

Please take a look at the changes

osmman · 2025-08-07T07:52:14Z

internal/controller/pod_controller.go

+	modelValidationName, ok := pod.Labels[constants.ModelValidationLabel]
+	if !ok || modelValidationName == "" {
+		// Try to remove the pod in case it was previously tracked but label was removed
+		if err := r.Tracker.RemovePodByName(ctx, req.NamespacedName); err != nil {


I think the ModelValidationFinalizer should be removed when a pod is no longer being tracked. This will prevent the pod from getting stuck in a terminating state.

I can add this, although the only time the pod will be stuck in the terminating phase is if the operator is not running. Pod deletes always check the finalizer, it doesn't matter whether it's in the tracker or not.

@osmman I've taken another look at this and I don't think this is a good idea, unless we change the way the finalizers are handled. At the moment the finalizer is injected via the webhook, so pairing with the pod deletion seems to be logically correct.

An alternative would be to move the finalizer to the status tracker, have it add the finalizer when it starts tracking and then remove it when no longer tracking.

One of the things we should do is add some dynamic capabilities to the operator, i.e. look at dynamically enabling/disabling the validation, so I suspect we will end up with the alternative approach in a later update if not now.

osmman · 2025-08-07T08:18:54Z

internal/tracker/status_tracker.go

+)
+
+// StatusTrackerImpl tracks injected pods and namespaces for ModelValidation resources
+type StatusTrackerImpl struct {


I'm having a hard time following the logic of the StatusTracker.

Could you clarify how the state of the tracked objects is preserved or recreated? I'm concerned about what happens to this information when the operator pod is restarted caused by upgrade, or moved to a different node.

The status tracker relies on the two controllers to provide the information, one is for tracking the pods and the other for tracking the ModelValidations. If the operator is restarted then the controllers will go through the List/Watch cycle. I do have a seeding part for the pods, but this is not strictly necessary as the pod controller will eventually provide the same information.

internal/tracker/status_tracker.go

…lidations Signed-off-by: Kevin Conner <[email protected]>

miyunari

Thanks @knrc ! Maybe a general thing. Personally I think its really hard to review PRs of this size. Do you think we can split this into multiple? 😃

miyunari · 2025-08-11T11:15:14Z

internal/controller/modelvalidation_controller.go

+// +kubebuilder:rbac:groups=ml.sigstore.dev,resources=modelvalidations/status,verbs=update
+
+// Reconcile handles ModelValidation events to track creation/updates/deletion
+func (r *ModelValidationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {


Maybe I don't understand the logic, but wouldn't it make sense to trigger the model validation CR reconciler whenever a pod is created or deleted, and then update the status at the end of the reconciliation? This could be done by listing all pods with an inject label that belong to the CR.

There are two controllers being used, one for pods and one for the model validation. Both of them trigger the status tracker which handles the reconciliation of information, then updating the model validation status.

miyunari · 2025-08-11T11:16:48Z

internal/tracker/debounced_queue.go

+
+// DebouncedQueue provides a queue with built-in debouncing functionality
+// It encapsulates both debouncing logic and workqueue implementation
+type DebouncedQueue interface {


Could you please explain why we need this queue and is there some risk of data corruption when the operator gets restarted and we lose the state?

The queue has two purposes

debounce the operations, so we don't risk flooding the API server with requests

handle asynchronous retries, should there be any issues with updating the model validation resources

@osmman asked a similar question about restarts

knrc · 2025-08-11T12:44:00Z

Thanks @knrc ! Maybe a general thing. Personally I think its really hard to review PRs of this size. Do you think we can split this into multiple? 😃

Yes, I'm the same. One PR should be for one task, which this is. Since the operator was only a webhook prior to this, unfortunately it's necessary.

knrc force-pushed the controller_status branch from c7c1d2a to 719b1b6 Compare August 7, 2025 00:02

osmman reviewed Aug 7, 2025

View reviewed changes

internal/controller/pod_controller.go Outdated Show resolved Hide resolved

osmman reviewed Aug 7, 2025

View reviewed changes

bouskaJ reviewed Aug 7, 2025

View reviewed changes

internal/tracker/status_tracker.go Outdated Show resolved Hide resolved

knrc force-pushed the controller_status branch 2 times, most recently from 83a783d to 53abc76 Compare August 8, 2025 02:40

feat: add status to the ModelValidation CRs, tracking pod and modelva…

bd28f29

…lidations Signed-off-by: Kevin Conner <[email protected]>

knrc force-pushed the controller_status branch from 53abc76 to bd28f29 Compare August 8, 2025 15:14

miyunari reviewed Aug 11, 2025

View reviewed changes

knrc mentioned this pull request Aug 11, 2025

Add metrics for reconciliation and status updates #48

Open

knrc mentioned this pull request Aug 21, 2025

Status metrics e2e tests #56

Open

feat: add status to the ModelValidation CRs, tracking pod and modelva… #47

Are you sure you want to change the base?

feat: add status to the ModelValidation CRs, tracking pod and modelva… #47

Uh oh!

Conversation

knrc commented Aug 2, 2025

Summary

Release Note

Documentation

Uh oh!

bouskaJ commented Aug 6, 2025

Uh oh!

knrc commented Aug 6, 2025

Uh oh!

knrc commented Aug 7, 2025

Uh oh!

Uh oh!

osmman Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miyunari left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knrc commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

osmman Aug 7, 2025 •

edited

Loading