fix(NSC): harden Network Services Controller against panics, races, and sync errors#2041
Open
Aprazor wants to merge 2 commits intocloudnativelabs:masterfrom
Open
fix(NSC): harden Network Services Controller against panics, races, and sync errors#2041Aprazor wants to merge 2 commits intocloudnativelabs:masterfrom
Aprazor wants to merge 2 commits intocloudnativelabs:masterfrom
Conversation
…nd sync errors
This combines five defensive fixes in the Network Services Controller:
1. shuffle(): check rand.Int error before dereferencing result
- rand.Int returns (nil, err) on failure, but the result was
dereferenced before the error check, causing a nil panic
2. NodePort healthcheck: add RWMutex to protect shared maps
- UpdateServicesInfo writes serviceInfoMap/endpointsInfoMap from
the sync goroutine while HTTP handlers read concurrently
3. setupIpvsFirewall: use continue instead of return in dual-stack loop
- return nil after clearing one IP family's chain skipped the
second family entirely on dual-stack nodes
4. setupMangleTableRule/cleanupMangleTableRule: add nil check for ParseIP
- net.ParseIP result was used without nil check, causing panic
on malformed IP strings from service annotations
5. synctypeIpvs: track errors across both sync steps for heartbeat
- err from syncIpvsServices was overwritten by syncHairpinIptablesRules,
masking IPVS failures from the health check system
This was referenced Mar 23, 2026
…d IP handling Table-driven tests following project conventions (testify assertions, subtests) covering: - shuffle: empty, single, and multi-element slices don't panic - NodePort healthcheck: concurrent read/write with RWMutex is safe - ParseIP: invalid IPs correctly return nil
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
bug
What this PR does / why we need it:
Consolidates five defensive fixes in the Network Services Controller (per @aauren's feedback on #2020):
rand.Intreturns(nil, err)on failure, but the result was dereferenced before the error checkUpdateServicesInfowrites shared maps from the sync goroutine while HTTP handlers read concurrently — addedsync.RWMutexreturn nilafter clearing one IP family skipped the second family — changed tocontinuenet.ParseIPresult used without nil check in bothsetupMangleTableRuleandcleanupMangleTableRuleerrfromsyncIpvsServiceswas overwritten bysyncHairpinIptablesRules, masking IPVS failures from health checksSupersedes: #2020, #2021, #2023, #2036, #2037
Was AI used during the creation of this PR?
What, if any, amount of integration testing was done with this change in a Kubernetes environment?
Unit tests pass (
make test-prettyfor proxy package). No integration testing.Does this PR introduce a breaking change?
Anything else the reviewer should know that wasn't already covered?
This is a consolidation of the 5 NSC-related PRs per @aauren's request. The individual PRs will be closed once this is reviewed.