Skip to content

Conversation

knrc
Copy link
Contributor

@knrc knrc commented Aug 2, 2025

…lidations

Summary

This PR closes #22, adding support for status updates within the ModelValidation. The status tracks a number of items

  • injectedPodCount (the number of pods associated with the current ModelValidation)
  • uninjectedPodCount (the number of pods labelled for the current ModelValidation, but without having been injected)
  • orphanedPodCount (the number of pods which were injected, but now orphaned because the ModelValidation configuration changed)
  • a list of pod information for each of the above

The PR introduces controllers for tracking pods and modelvalidations, with a separate status tracker for coordinating the tracking and for updating status. Status updates for the ModelValidation resources are debounced, with backoff and retries.

This PR also includes modifications to the CRD printer columns to include the count, for example

NAMESPACE             NAME             AUTH METHOD   INJECTED PODS   UNINJECTED PODS   ORPHANED PODS   AGE
e2e-webhook-test-ns   e2e-test-model                                                                   0s
e2e-webhook-test-ns   e2e-test-model   sigstore      1               0                 0               0s

Unit tests have been written, but we are still missing end to end tests and metrics. The end to end tests and metrics will be handled in separate PRs, as this one is already sizeable.

Release Note

Added status tracking for ModelValidation resources, to track pods which are associated with individual resources. The status of each resource contains

  • the number of pods associated with the current configuration
  • the number of pods associated with a previous configuration
  • the number of pods associated with the ModelValidation, but not injected for validation
  • Lists of pods for each of the previous counts

Documentation

@bouskaJ
Copy link
Contributor

bouskaJ commented Aug 6, 2025

Hello. I did a quick review, as I didn't have much time to look into it further yet. One thing came to mind. You're implementing a retry function with backoff. Isn't it possible to use something from https://github.com/kubernetes/client-go/blob/master/util/retry/util.go instead?

@knrc
Copy link
Contributor Author

knrc commented Aug 6, 2025

Hello. I did a quick review, as I didn't have much time to look into it further yet. One thing came to mind. You're implementing a retry function with backoff. Isn't it possible to use something from https://github.com/kubernetes/client-go/blob/master/util/retry/util.go instead?

That's slightly different, it's a synchronous way of handling the retries so would block any other processing taking place.

The intention of the current retry implementation was to do asynchronous retries with backoffs, but I've just realized that by pushing it back through the debouncer to merge with any other updates I'm losing that. I'll fix it and push up the changes shortly.

Thanks for the feedback

@knrc knrc force-pushed the controller_status branch from c7c1d2a to 719b1b6 Compare August 7, 2025 00:02
@knrc
Copy link
Contributor Author

knrc commented Aug 7, 2025

The intention of the current retry implementation was to do asynchronous retries with backoffs, but I've just realized that by pushing it back through the debouncer to merge with any other updates I'm losing that. I'll fix it and push up the changes shortly.

@bouskaJ While fixing this I realized that the workqueue has an option for handling exponential failures, so I've rewritten the code to use that and added a couple of test cases just to make sure it's behaving the way I expect. Please take another look, you can check the differences here

}

// GetConfigHash returns a hash of the validation configuration for drift detection
func (mv *ModelValidation) GetConfigHash() string {
Copy link

@osmman osmman Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, instead of creating a custom hashing method for drift detection, it's best to use Kubernetes build-in fields. These are more robust and are the standard pattern for operators.

Use .metadata.generation and a status field like status.observedGeneration or ObservedGeneration field within a Condition struct. The reconciliation loop should follow this logic:

  1. On each reconcile, compare .metadata.generation with status.observedGeneration.
  2. If the values don't match, it indicates the spec has been updated and a reconciliation is needed.
  3. After a successful reconciliation, update status.observedGeneration to equal .metadata.generation.

Using .metadata.resourceVersion is another option, but generation is typically preferred for tracking spec-only changes, as resourceVersion changes on every update, including status updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashing is not just about detecting drift, it's also used for tracking within PodInfo so I know which specific configuration applied to the pods. I have no problem adding the observed generation in addition to the hash, as a quick sanity check, but not replacing the hash. I wouldn't want to use resourceVersion, as that comes from etcd and represents every change to the resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osmman I've updated the code to add the generation check, so we don't need to redo the hash if that hasn't changed. I also tidied up some older code, it was still tracking resourceVersion but I never used it.

Please take a look at the changes

modelValidationName, ok := pod.Labels[constants.ModelValidationLabel]
if !ok || modelValidationName == "" {
// Try to remove the pod in case it was previously tracked but label was removed
if err := r.Tracker.RemovePodByName(ctx, req.NamespacedName); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ModelValidationFinalizer should be removed when a pod is no longer being tracked. This will prevent the pod from getting stuck in a terminating state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add this, although the only time the pod will be stuck in the terminating phase is if the operator is not running. Pod deletes always check the finalizer, it doesn't matter whether it's in the tracker or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osmman I've taken another look at this and I don't think this is a good idea, unless we change the way the finalizers are handled. At the moment the finalizer is injected via the webhook, so pairing with the pod deletion seems to be logically correct.

An alternative would be to move the finalizer to the status tracker, have it add the finalizer when it starts tracking and then remove it when no longer tracking.

One of the things we should do is add some dynamic capabilities to the operator, i.e. look at dynamically enabling/disabling the validation, so I suspect we will end up with the alternative approach in a later update if not now.

)

// StatusTrackerImpl tracks injected pods and namespaces for ModelValidation resources
type StatusTrackerImpl struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following the logic of the StatusTracker.

Could you clarify how the state of the tracked objects is preserved or recreated? I'm concerned about what happens to this information when the operator pod is restarted caused by upgrade, or moved to a different node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status tracker relies on the two controllers to provide the information, one is for tracking the pods and the other for tracking the ModelValidations. If the operator is restarted then the controllers will go through the List/Watch cycle. I do have a seeding part for the pods, but this is not strictly necessary as the pod controller will eventually provide the same information.

@knrc knrc force-pushed the controller_status branch 2 times, most recently from 83a783d to 53abc76 Compare August 8, 2025 02:40
@knrc knrc force-pushed the controller_status branch from 53abc76 to bd28f29 Compare August 8, 2025 15:14
Copy link
Member

@miyunari miyunari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @knrc ! Maybe a general thing. Personally I think its really hard to review PRs of this size. Do you think we can split this into multiple? 😃

// +kubebuilder:rbac:groups=ml.sigstore.dev,resources=modelvalidations/status,verbs=update

// Reconcile handles ModelValidation events to track creation/updates/deletion
func (r *ModelValidationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand the logic, but wouldn't it make sense to trigger the model validation CR reconciler whenever a pod is created or deleted, and then update the status at the end of the reconciliation? This could be done by listing all pods with an inject label that belong to the CR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two controllers being used, one for pods and one for the model validation. Both of them trigger the status tracker which handles the reconciliation of information, then updating the model validation status.


// DebouncedQueue provides a queue with built-in debouncing functionality
// It encapsulates both debouncing logic and workqueue implementation
type DebouncedQueue interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we need this queue and is there some risk of data corruption when the operator gets restarted and we lose the state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue has two purposes

  • debounce the operations, so we don't risk flooding the API server with requests
  • handle asynchronous retries, should there be any issues with updating the model validation resources

@osmman asked a similar question about restarts

@knrc
Copy link
Contributor Author

knrc commented Aug 11, 2025

Thanks @knrc ! Maybe a general thing. Personally I think its really hard to review PRs of this size. Do you think we can split this into multiple? 😃

Yes, I'm the same. One PR should be for one task, which this is. Since the operator was only a webhook prior to this, unfortunately it's necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide status fields for CR

4 participants