Skip to content

Comments

fix: Skip redundant noop status updates in crp#1079

Closed
ahmetb wants to merge 2 commits intoAzure:mainfrom
ahmetb:ahmet/status-skip
Closed

fix: Skip redundant noop status updates in crp#1079
ahmetb wants to merge 2 commits intoAzure:mainfrom
ahmetb:ahmet/status-skip

Conversation

@ahmetb
Copy link

@ahmetb ahmetb commented Mar 13, 2025

Description of your changes

Compare cached object to calculated status and if nothing has changed, skip the status update. When the cache catches up, we'll get another event in the workqueue anyway and we can catch the drift.

Otherwise this is causing a non-stop stream of crp/status updates when you have thousands of namespaces and each takes time to update status even though the update was actually noop, which causes workqueue buildup.

Fixes #940.

  • Run make reviewable to ensure this PR is ready for review.

linter seems to be broken at the go version 1.23.5?

 ERRO Running error: 1 error occurred:
         * can't run linter goanalysis_metalinter: goanalysis_metalinter: buildir: package "slices" (isInitialPkg: false, needAnalyzeSource: true): Cannot range over: func(yield func(E) bool)
 ...
 ERRO [linters_context/goanalysis] SA5012: panic during analysis: interface conversion: interface {} is nil, not *buildir.IR, goroutine 12724 [running]:
 runtime/debug.Stack()
 ...

How has this code been tested

N/A –relying on existing tests.

Special notes for your reviewer

Context at #940.

Compare cached object to calculated status and if nothing has changed, skip
the status update. When the cache catches up, we'll get another event in the
workqueue anyway and we can catch the drift.

Otherwise this is causing a non-stop stream of crp/status updates when you
have thousands of namespaces and each takes time to update status even though
the update was actually noop, which causes workqueue buildup.

Fixes Azure#940.
@ArchanaAnand0212
Copy link

test failure seems not to be related to the change
there need to err != nil here

[FAILED] Timed out after 10.001s.
  Failed to remove work resources from member cluster kind-cluster-1
  Expected success, but got an error:
      <*fmt.wrapError | 0xc000689020>: 
      work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)
      {
          msg: "work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)",
          err: nil,
      }

Other PR has similar failure.

@ryanzhang-oss
Copy link
Contributor

ryanzhang-oss commented Mar 14, 2025

Thank you for the POC. Unfortunately, the CRP controller is not a controller-runtime controller (we need a customized one in order to accommodate our special case to watch every object in the cluster) so the reconcile loop won't be triggered by its status change. We don't want to reconcile based on status either. With that said, we will brainstorm some ideas to avoid the update storm upon restart.

@ryanzhang-oss
Copy link
Contributor

Thank you for the POC. Unfortunately, the CRP controller is not a controller-runtime controller (we need a customized one in order to accommodate our special case to watch every object in the cluster) so the reconcile loop won't be triggered by its status change. We don't want to reconcile based on status either. With that side, we can definitely brainstorm some ideas to avoid the update storm upon restart.

@ryanzhang-oss
Copy link
Contributor

test failure seems not to be related to the change there need to err != nil here

[FAILED] Timed out after 10.001s.
  Failed to remove work resources from member cluster kind-cluster-1
  Expected success, but got an error:
      <*fmt.wrapError | 0xc000689020>: 
      work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)
      {
          msg: "work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)",
          err: nil,
      }

Other PR has similar failure.

yeah, this is not related to this PR. We have a PR to fix the flaky e2e

@ryanzhang-oss ryanzhang-oss changed the title fix(crp): Skip redundant noop status updates fix: Skip redundant noop status updates in crp Mar 14, 2025
if err := r.Client.Status().Update(ctx, crp); err != nil {
klog.ErrorS(err, "Failed to update the status", "clusterResourcePlacement", crpKObj)
return ctrl.Result{}, err
if !apiequality.Semantic.DeepEqual(oldCRP.Status, crp.Status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortuantely, this is not safe as we don't reconcile on status change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't quite understand:

  • who else do you expect will update the status of this resource and make it out of sync?
  • why can't a periodic resync that you already have course-correct drifts eventually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Only the CRP controller changes its status but the reconcile is triggered by many other objects's status change.
  • if we rely on periodic resync to course-correct then it means the resync period needs to be reasonably short (say a min) which is something we agreed not a good practice.

We should still try to optimize the startup sequence though

@michaelawyu
Copy link
Contributor

Hi Ahmet (Long time no see ✨)! I am closing this PR as Fleet has been accepted as a CNCF sandbox project, and per our agreement with CNCF we will be moving to a CNCF hosted repo for future development; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] hub-agent making noop /status updates on every reconciliation

4 participants