Skip to content

Conversation

@rbtr
Copy link
Collaborator

@rbtr rbtr commented Feb 25, 2025

Instead of crashing after 10 retries to initialize the CNS state, this change retries until it succeeds and increments a metric if it doesn't to count NNC init failures. Also adds a positive-signal metric "hasNNCInitialized" to signal that this process has completed succesfully.

@rbtr rbtr added cns Related to CNS. release/latest Change affects latest release train needs-backport Change needs to be backported to previous release trains labels Feb 25, 2025
@rbtr rbtr self-assigned this Feb 25, 2025
Copilot AI review requested due to automatic review settings February 25, 2025 23:32
@rbtr rbtr requested a review from a team as a code owner February 25, 2025 23:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

cns/service/metrics.go:9

  • Typo in comment: 'monotic' should be 'monotonic'.
    // managerStartFailures is a monotic counter which tracks the number of times the controller-runtime

@rbtr
Copy link
Collaborator Author

rbtr commented Feb 26, 2025

/azp run Azure Container Networking PR

@rbtr rbtr enabled auto-merge February 26, 2025 17:38
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr requested a review from Copilot February 28, 2025 06:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request updates the CNS initialization process to retry until successful, tracking failures and success via new metrics. Key changes include:

  • Replacing a finite retry count with an infinite (until succeeded) exponential backoff retrier in the main service.
  • Incrementing a failure metric (nncInitFailure) on each NNC init failure and setting a success gauge (hasNNCInitialized) after successful reconciliation.
  • Adding two new Prometheus metrics in metrics.go for NNC initialization tracking.

Reviewed Changes

File Description
cns/service/main.go Updated retry logic and added metric instrumentation for CNS init state
cns/service/metrics.go Added two new metrics (nncInitFailure and hasNNCInitialized) with registration

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

cns/service/main.go:1462

  • The variable 'initCNSInitalDelay' appears to have a typo; consider renaming it to 'initCNSInitialDelay' for clarity.
}, retry.Context(ctx), retry.Delay(initCNSInitalDelay), retry.MaxDelay(time.Minute), retry.UntilSucceeded())

cns/service/metrics.go:29

  • The word 'monotic' in the comment appears to be a typo; consider changing it to 'monotonic'.
// nncInitFailure is a monotic counter which tracks the number of times the initial NNC reconcile has failed.

@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Mar 15, 2025
@github-actions
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Mar 23, 2025
auto-merge was automatically disabled March 23, 2025 00:01

Pull request was closed

@github-actions github-actions bot deleted the feat/cns-nnc-init-failures branch March 23, 2025 00:01
@rbtr rbtr restored the feat/cns-nnc-init-failures branch March 24, 2025 18:20
@rbtr rbtr reopened this Mar 24, 2025
@rbtr rbtr enabled auto-merge March 24, 2025 18:24
@rbtr rbtr removed the stale Stale due to inactivity. label Mar 24, 2025
@github-actions
Copy link

github-actions bot commented Apr 8, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Apr 8, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Apr 10, 2025
@rbtr
Copy link
Collaborator Author

rbtr commented Apr 10, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Apr 25, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Apr 30, 2025
@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label May 16, 2025
@github-actions
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this May 23, 2025
auto-merge was automatically disabled May 23, 2025 00:01

Pull request was closed

@github-actions github-actions bot deleted the feat/cns-nnc-init-failures branch May 23, 2025 00:01
@rbtr rbtr added exempt-stale Keep this fresh and removed stale Stale due to inactivity. labels May 29, 2025
@rbtr rbtr restored the feat/cns-nnc-init-failures branch May 29, 2025 22:51
@rbtr rbtr reopened this May 29, 2025
@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jun 13, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Jun 13, 2025
@rbtr rbtr enabled auto-merge June 13, 2025 17:54
@rbtr rbtr added this pull request to the merge queue Jun 13, 2025
Merged via the queue into master with commit d61a128 Jun 13, 2025
31 of 32 checks passed
@rbtr rbtr deleted the feat/cns-nnc-init-failures branch June 13, 2025 21:04
sivakami-projects pushed a commit that referenced this pull request Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cns Related to CNS. exempt-stale Keep this fresh needs-backport Change needs to be backported to previous release trains release/latest Change affects latest release train

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants