Skip to content

fix(NSC): harden Network Services Controller against panics, races, and sync errors#2041

Open
Aprazor wants to merge 2 commits intocloudnativelabs:masterfrom
Aprazor:fix/nsc-harden-panics-races-sync-errors
Open

fix(NSC): harden Network Services Controller against panics, races, and sync errors#2041
Aprazor wants to merge 2 commits intocloudnativelabs:masterfrom
Aprazor:fix/nsc-harden-panics-races-sync-errors

Conversation

@Aprazor
Copy link
Copy Markdown

@Aprazor Aprazor commented Mar 23, 2026

What type of PR is this?

bug

What this PR does / why we need it:

Consolidates five defensive fixes in the Network Services Controller (per @aauren's feedback on #2020):

  1. shuffle() nil panic: rand.Int returns (nil, err) on failure, but the result was dereferenced before the error check
  2. NodePort healthcheck data race: UpdateServicesInfo writes shared maps from the sync goroutine while HTTP handlers read concurrently — added sync.RWMutex
  3. dual-stack firewall chain: return nil after clearing one IP family skipped the second family — changed to continue
  4. mangle table nil panic: net.ParseIP result used without nil check in both setupMangleTableRule and cleanupMangleTableRule
  5. heartbeat after partial sync failure: err from syncIpvsServices was overwritten by syncHairpinIptablesRules, masking IPVS failures from health checks

Supersedes: #2020, #2021, #2023, #2036, #2037

Was AI used during the creation of this PR?

  • What tool was used: Claude Code
  • To what extent was the tool used? Code review identified the bugs, human reviewed and confirmed each fix
  • If drafted, how detailed of a plan did you create for the AI? Detailed — each bug was traced line-by-line and verified before fixing
  • Help us understand if a human was in the loop or not for this PR? Yes — human confirmed all findings, reviewed diffs, and approved before submission

What, if any, amount of integration testing was done with this change in a Kubernetes environment?

Unit tests pass (make test-pretty for proxy package). No integration testing.

Does this PR introduce a breaking change?

NONE

Anything else the reviewer should know that wasn't already covered?

This is a consolidation of the 5 NSC-related PRs per @aauren's request. The individual PRs will be closed once this is reviewed.

…nd sync errors

This combines five defensive fixes in the Network Services Controller:

1. shuffle(): check rand.Int error before dereferencing result
   - rand.Int returns (nil, err) on failure, but the result was
     dereferenced before the error check, causing a nil panic

2. NodePort healthcheck: add RWMutex to protect shared maps
   - UpdateServicesInfo writes serviceInfoMap/endpointsInfoMap from
     the sync goroutine while HTTP handlers read concurrently

3. setupIpvsFirewall: use continue instead of return in dual-stack loop
   - return nil after clearing one IP family's chain skipped the
     second family entirely on dual-stack nodes

4. setupMangleTableRule/cleanupMangleTableRule: add nil check for ParseIP
   - net.ParseIP result was used without nil check, causing panic
     on malformed IP strings from service annotations

5. synctypeIpvs: track errors across both sync steps for heartbeat
   - err from syncIpvsServices was overwritten by syncHairpinIptablesRules,
     masking IPVS failures from the health check system
…d IP handling

Table-driven tests following project conventions (testify assertions,
subtests) covering:
- shuffle: empty, single, and multi-element slices don't panic
- NodePort healthcheck: concurrent read/write with RWMutex is safe
- ParseIP: invalid IPs correctly return nil
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant