-
Notifications
You must be signed in to change notification settings - Fork 7
feat: add status to the ModelValidation CRs, tracking pod and modelva… #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hello. I did a quick review, as I didn't have much time to look into it further yet. One thing came to mind. You're implementing a retry function with backoff. Isn't it possible to use something from https://github.com/kubernetes/client-go/blob/master/util/retry/util.go instead? |
That's slightly different, it's a synchronous way of handling the retries so would block any other processing taking place. The intention of the current retry implementation was to do asynchronous retries with backoffs, but I've just realized that by pushing it back through the debouncer to merge with any other updates I'm losing that. I'll fix it and push up the changes shortly. Thanks for the feedback |
@bouskaJ While fixing this I realized that the workqueue has an option for handling exponential failures, so I've rewritten the code to use that and added a couple of test cases just to make sure it's behaving the way I expect. Please take another look, you can check the differences here |
} | ||
|
||
// GetConfigHash returns a hash of the validation configuration for drift detection | ||
func (mv *ModelValidation) GetConfigHash() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, instead of creating a custom hashing method for drift detection, it's best to use Kubernetes build-in fields. These are more robust and are the standard pattern for operators.
Use .metadata.generation
and a status field like status.observedGeneration
or ObservedGeneration field within a Condition struct. The reconciliation loop should follow this logic:
- On each reconcile, compare
.metadata.generation
withstatus.observedGeneration
. - If the values don't match, it indicates the spec has been updated and a reconciliation is needed.
- After a successful reconciliation, update
status.observedGeneration
to equal.metadata.generation
.
Using .metadata.resourceVersion
is another option, but generation
is typically preferred for tracking spec-only changes, as resourceVersion
changes on every update, including status updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashing is not just about detecting drift, it's also used for tracking within PodInfo
so I know which specific configuration applied to the pods. I have no problem adding the observed generation in addition to the hash, as a quick sanity check, but not replacing the hash. I wouldn't want to use resourceVersion, as that comes from etcd
and represents every change to the resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osmman I've updated the code to add the generation check, so we don't need to redo the hash if that hasn't changed. I also tidied up some older code, it was still tracking resourceVersion but I never used it.
Please take a look at the changes
modelValidationName, ok := pod.Labels[constants.ModelValidationLabel] | ||
if !ok || modelValidationName == "" { | ||
// Try to remove the pod in case it was previously tracked but label was removed | ||
if err := r.Tracker.RemovePodByName(ctx, req.NamespacedName); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ModelValidationFinalizer
should be removed when a pod is no longer being tracked. This will prevent the pod from getting stuck in a terminating state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add this, although the only time the pod will be stuck in the terminating phase is if the operator is not running. Pod deletes always check the finalizer, it doesn't matter whether it's in the tracker or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osmman I've taken another look at this and I don't think this is a good idea, unless we change the way the finalizers are handled. At the moment the finalizer is injected via the webhook, so pairing with the pod deletion seems to be logically correct.
An alternative would be to move the finalizer to the status tracker, have it add the finalizer when it starts tracking and then remove it when no longer tracking.
One of the things we should do is add some dynamic capabilities to the operator, i.e. look at dynamically enabling/disabling the validation, so I suspect we will end up with the alternative approach in a later update if not now.
) | ||
|
||
// StatusTrackerImpl tracks injected pods and namespaces for ModelValidation resources | ||
type StatusTrackerImpl struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time following the logic of the StatusTracker.
Could you clarify how the state of the tracked objects is preserved or recreated? I'm concerned about what happens to this information when the operator pod is restarted caused by upgrade, or moved to a different node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status tracker relies on the two controllers to provide the information, one is for tracking the pods and the other for tracking the ModelValidations. If the operator is restarted then the controllers will go through the List/Watch cycle. I do have a seeding part for the pods, but this is not strictly necessary as the pod controller will eventually provide the same information.
83a783d
to
53abc76
Compare
…lidations Signed-off-by: Kevin Conner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @knrc ! Maybe a general thing. Personally I think its really hard to review PRs of this size. Do you think we can split this into multiple? 😃
// +kubebuilder:rbac:groups=ml.sigstore.dev,resources=modelvalidations/status,verbs=update | ||
|
||
// Reconcile handles ModelValidation events to track creation/updates/deletion | ||
func (r *ModelValidationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't understand the logic, but wouldn't it make sense to trigger the model validation CR reconciler whenever a pod is created or deleted, and then update the status at the end of the reconciliation? This could be done by listing all pods with an inject label that belong to the CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two controllers being used, one for pods and one for the model validation. Both of them trigger the status tracker which handles the reconciliation of information, then updating the model validation status.
|
||
// DebouncedQueue provides a queue with built-in debouncing functionality | ||
// It encapsulates both debouncing logic and workqueue implementation | ||
type DebouncedQueue interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why we need this queue and is there some risk of data corruption when the operator gets restarted and we lose the state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue has two purposes
- debounce the operations, so we don't risk flooding the API server with requests
- handle asynchronous retries, should there be any issues with updating the model validation resources
@osmman asked a similar question about restarts
Yes, I'm the same. One PR should be for one task, which this is. Since the operator was only a webhook prior to this, unfortunately it's necessary. |
…lidations
Summary
This PR closes #22, adding support for status updates within the ModelValidation. The status tracks a number of items
The PR introduces controllers for tracking pods and modelvalidations, with a separate status tracker for coordinating the tracking and for updating status. Status updates for the ModelValidation resources are debounced, with backoff and retries.
This PR also includes modifications to the CRD printer columns to include the count, for example
Unit tests have been written, but we are still missing end to end tests and metrics. The end to end tests and metrics will be handled in separate PRs, as this one is already sizeable.
Release Note
Added status tracking for ModelValidation resources, to track pods which are associated with individual resources. The status of each resource contains
Documentation