fix: Skip redundant noop status updates in crp by ahmetb · Pull Request #1079 · Azure/fleet

ahmetb · 2025-03-13T16:52:56Z

Description of your changes

Compare cached object to calculated status and if nothing has changed, skip the status update. When the cache catches up, we'll get another event in the workqueue anyway and we can catch the drift.

Otherwise this is causing a non-stop stream of crp/status updates when you have thousands of namespaces and each takes time to update status even though the update was actually noop, which causes workqueue buildup.

Fixes #940.

Run make reviewable to ensure this PR is ready for review.

linter seems to be broken at the go version 1.23.5?

 ERRO Running error: 1 error occurred:
         * can't run linter goanalysis_metalinter: goanalysis_metalinter: buildir: package "slices" (isInitialPkg: false, needAnalyzeSource: true): Cannot range over: func(yield func(E) bool)
 ...
 ERRO [linters_context/goanalysis] SA5012: panic during analysis: interface conversion: interface {} is nil, not *buildir.IR, goroutine 12724 [running]:
 runtime/debug.Stack()
 ...

How has this code been tested

N/A –relying on existing tests.

Special notes for your reviewer

Context at #940.

Compare cached object to calculated status and if nothing has changed, skip the status update. When the cache catches up, we'll get another event in the workqueue anyway and we can catch the drift. Otherwise this is causing a non-stop stream of crp/status updates when you have thousands of namespaces and each takes time to update status even though the update was actually noop, which causes workqueue buildup. Fixes Azure#940.

ArchanaAnand0212 · 2025-03-13T22:24:39Z

test failure seems not to be related to the change
there need to err != nil here

[FAILED] Timed out after 10.001s.
  Failed to remove work resources from member cluster kind-cluster-1
  Expected success, but got an error:
      <*fmt.wrapError | 0xc000689020>: 
      work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)
      {
          msg: "work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)",
          err: nil,
      }

Other PR has similar failure.

ryanzhang-oss · 2025-03-14T00:16:46Z

Thank you for the POC. Unfortunately, the CRP controller is not a controller-runtime controller (we need a customized one in order to accommodate our special case to watch every object in the cluster) so the reconcile loop won't be triggered by its status change. We don't want to reconcile based on status either. With that said, we will brainstorm some ideas to avoid the update storm upon restart.

ryanzhang-oss · 2025-03-14T00:16:50Z

Thank you for the POC. Unfortunately, the CRP controller is not a controller-runtime controller (we need a customized one in order to accommodate our special case to watch every object in the cluster) so the reconcile loop won't be triggered by its status change. We don't want to reconcile based on status either. With that side, we can definitely brainstorm some ideas to avoid the update storm upon restart.

ryanzhang-oss · 2025-03-14T00:18:16Z

test failure seems not to be related to the change there need to err != nil here

[FAILED] Timed out after 10.001s.
  Failed to remove work resources from member cluster kind-cluster-1
  Expected success, but got an error:
      <*fmt.wrapError | 0xc000689020>: 
      work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)
      {
          msg: "work namespace application-3 still exists or an unexpected error occurred: %!w(<nil>)",
          err: nil,
      }

Other PR has similar failure.

yeah, this is not related to this PR. We have a PR to fix the flaky e2e

ryanzhang-oss · 2025-03-18T02:08:49Z

pkg/controllers/clusterresourceplacement/controller.go

-	if err := r.Client.Status().Update(ctx, crp); err != nil {
-		klog.ErrorS(err, "Failed to update the status", "clusterResourcePlacement", crpKObj)
-		return ctrl.Result{}, err
+	if !apiequality.Semantic.DeepEqual(oldCRP.Status, crp.Status) {


unfortuantely, this is not safe as we don't reconcile on status change

I still don't quite understand:

who else do you expect will update the status of this resource and make it out of sync?

why can't a periodic resync that you already have course-correct drifts eventually?

Only the CRP controller changes its status but the reconcile is triggered by many other objects's status change.

if we rely on periodic resync to course-correct then it means the resync period needs to be reasonably short (say a min) which is something we agreed not a good practice.

We should still try to optimize the startup sequence though

michaelawyu · 2025-04-15T15:40:41Z

Hi Ahmet (Long time no see ✨)! I am closing this PR as Fleet has been accepted as a CNCF sandbox project, and per our agreement with CNCF we will be moving to a CNCF hosted repo for future development; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏

ahmetb mentioned this pull request Mar 13, 2025

[BUG] hub-agent making noop /status updates on every reconciliation #940

Open

Merge branch 'main' into ahmet/status-skip

94ac9b5

ryanzhang-oss changed the title ~~fix(crp): Skip redundant noop status updates~~ fix: Skip redundant noop status updates in crp Mar 14, 2025

ryanzhang-oss reviewed Mar 18, 2025

View reviewed changes

michaelawyu closed this Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: Skip redundant noop status updates in crp#1079

fix: Skip redundant noop status updates in crp#1079
ahmetb wants to merge 2 commits intoAzure:mainfrom
ahmetb:ahmet/status-skip

ahmetb commented Mar 13, 2025 •

edited

Loading

Uh oh!

ArchanaAnand0212 commented Mar 13, 2025

Uh oh!

ryanzhang-oss commented Mar 14, 2025 •

edited

Loading

Uh oh!

ryanzhang-oss commented Mar 14, 2025

Uh oh!

ryanzhang-oss commented Mar 14, 2025

Uh oh!

ryanzhang-oss Mar 18, 2025

Uh oh!

ahmetb Mar 18, 2025

Uh oh!

ryanzhang-oss Mar 18, 2025

Uh oh!

michaelawyu commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

ahmetb commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes

How has this code been tested

Special notes for your reviewer

Uh oh!

ArchanaAnand0212 commented Mar 13, 2025

Uh oh!

ryanzhang-oss commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanzhang-oss commented Mar 14, 2025

Uh oh!

ryanzhang-oss commented Mar 14, 2025

Uh oh!

ryanzhang-oss Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

ahmetb Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

ryanzhang-oss Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

michaelawyu commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ahmetb commented Mar 13, 2025 •

edited

Loading

ryanzhang-oss commented Mar 14, 2025 •

edited

Loading