Skip to content

Conversation

@rbtr
Copy link
Collaborator

@rbtr rbtr commented Mar 25, 2025

#3498 reverted two functional changes to the CNS init HNS interaction:

  1. always restart HNS after attempting to set the regkey
  2. retry restarting HNS if it hangs

It turns out that HNS hangs are common enough that they notably impact CNS on Windows becoming ready and are preventing that revert from passing tests and rolling out. Thus, it is not safe for CNS to block on HNS restarting - if it hangs, CNS never becomes ready, meaning the Node never becomes Ready.

This is a retake on (2) while keeping (1) reverted - we will only try to restart HNS if the regkey is not already set. But when we do try to restart it, we will retry this instead of hanging forever.

Copilot AI review requested due to automatic review settings March 25, 2025 00:40
@rbtr rbtr requested a review from a team as a code owner March 25, 2025 00:40
@rbtr rbtr self-assigned this Mar 25, 2025
@rbtr rbtr added the cns Related to CNS. label Mar 25, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the robustness of the CNS HNS restart mechanism on Windows by incorporating retry logic to prevent indefinite hangs. The changes include:

  • Adding retry logic via the retry-go package to stop the HNS service.
  • Introducing a new helper function (tryStopServiceFn) to encapsulate stopping logic.
  • Expanding test coverage in platform/os_windows_test.go to validate multiple scenarios for stopping the HNS service.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
platform/os_windows.go Added retry logic and refactored service stop/start functionality.
platform/os_windows_test.go Added tests for tryStopServiceFn to cover various service scenarios.

@rbtr
Copy link
Collaborator Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr enabled auto-merge March 25, 2025 16:11
@rbtr rbtr force-pushed the fix/retry-hns-restart branch from 47f6af9 to f11b9a0 Compare March 25, 2025 16:14
ashvindeodhar
ashvindeodhar previously approved these changes Mar 25, 2025
@rbtr
Copy link
Collaborator Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr
Copy link
Collaborator Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr force-pushed the fix/retry-hns-restart branch from d93c6d9 to 0d0817d Compare March 25, 2025 18:43
@rbtr
Copy link
Collaborator Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr force-pushed the fix/retry-hns-restart branch from 0d0817d to 0e72c6c Compare March 25, 2025 20:46
@rbtr
Copy link
Collaborator Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr force-pushed the fix/retry-hns-restart branch from 0e72c6c to 9bab884 Compare March 25, 2025 21:30
@rbtr
Copy link
Collaborator Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses the issue of HNS hanging when restarting by introducing retry logic so that CNS does not block indefinitely waiting on HNS.

  • Use of the retry-go library to repeatedly attempt stopping and starting the HNS service.
  • Addition of helper functions (tryStopServiceFn and tryStartServiceFn) to encapsulate the retry logic.
  • Expansion of unit tests in os_windows_test.go to cover the new retry-based service operations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
platform/os_windows.go Implements retry logic with new helper functions for stopping/starting HNS.
platform/os_windows_test.go Adds tests and updates imports to cover the new retry-based service operation code.

@rbtr rbtr added this pull request to the merge queue Mar 27, 2025
Merged via the queue into master with commit ef873ba Mar 27, 2025
92 checks passed
@rbtr rbtr deleted the fix/retry-hns-restart branch March 27, 2025 18:54
rbtr added a commit that referenced this pull request Apr 4, 2025
github-merge-queue bot pushed a commit that referenced this pull request May 2, 2025
sivakami-projects pushed a commit that referenced this pull request Oct 23, 2025
* fix: carefully retry restarting HNS if it hangs

Signed-off-by: Evan Baker <[email protected]>

* retry start, check stop pending

Signed-off-by: Evan Baker <[email protected]>

---------

Signed-off-by: Evan Baker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cns Related to CNS.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants