fix: Skip redundant noop status updates in crp#1079
fix: Skip redundant noop status updates in crp#1079ahmetb wants to merge 2 commits intoAzure:mainfrom
Conversation
Compare cached object to calculated status and if nothing has changed, skip the status update. When the cache catches up, we'll get another event in the workqueue anyway and we can catch the drift. Otherwise this is causing a non-stop stream of crp/status updates when you have thousands of namespaces and each takes time to update status even though the update was actually noop, which causes workqueue buildup. Fixes Azure#940.
|
test failure seems not to be related to the change Other PR has similar failure. |
|
Thank you for the POC. Unfortunately, the CRP controller is not a controller-runtime controller (we need a customized one in order to accommodate our special case to watch every object in the cluster) so the reconcile loop won't be triggered by its status change. We don't want to reconcile based on status either. With that said, we will brainstorm some ideas to avoid the update storm upon restart. |
|
Thank you for the POC. Unfortunately, the CRP controller is not a controller-runtime controller (we need a customized one in order to accommodate our special case to watch every object in the cluster) so the reconcile loop won't be triggered by its status change. We don't want to reconcile based on status either. With that side, we can definitely brainstorm some ideas to avoid the update storm upon restart. |
yeah, this is not related to this PR. We have a PR to fix the flaky e2e |
| if err := r.Client.Status().Update(ctx, crp); err != nil { | ||
| klog.ErrorS(err, "Failed to update the status", "clusterResourcePlacement", crpKObj) | ||
| return ctrl.Result{}, err | ||
| if !apiequality.Semantic.DeepEqual(oldCRP.Status, crp.Status) { |
There was a problem hiding this comment.
unfortuantely, this is not safe as we don't reconcile on status change
There was a problem hiding this comment.
I still don't quite understand:
- who else do you expect will update the status of this resource and make it out of sync?
- why can't a periodic resync that you already have course-correct drifts eventually?
There was a problem hiding this comment.
- Only the CRP controller changes its status but the reconcile is triggered by many other objects's status change.
- if we rely on periodic resync to course-correct then it means the resync period needs to be reasonably short (say a min) which is something we agreed not a good practice.
We should still try to optimize the startup sequence though
|
Hi Ahmet (Long time no see ✨)! I am closing this PR as Fleet has been accepted as a CNCF sandbox project, and per our agreement with CNCF we will be moving to a CNCF hosted repo for future development; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏 |
Description of your changes
Compare cached object to calculated status and if nothing has changed, skip the status update. When the cache catches up, we'll get another event in the workqueue anyway and we can catch the drift.
Otherwise this is causing a non-stop stream of crp/status updates when you have thousands of namespaces and each takes time to update status even though the update was actually noop, which causes workqueue buildup.
Fixes #940.
make reviewableto ensure this PR is ready for review.How has this code been tested
N/A –relying on existing tests.
Special notes for your reviewer
Context at #940.