Skip to content

Conversation

@mtulio
Copy link
Contributor

@mtulio mtulio commented Jul 15, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR propose fix on leaked security group (SG) when a Service type-loadBalancer (CLB) is updated adding the BYO SG annotation (service.beta.kubernetes.io/aws-load-balancer-security-groups), which replaces all SG added to the Load Balancer without removing linked rules, as well not deleting managed SG (created by controller).

Which issue(s) this PR fixes:

Fixes #1208

Special notes for your reviewer:

We decided of creating isolated dedicated methods to discover and remove linked rule's SG targeting to:

  • enhance code maintenance
  • enhance unit tests
  • allow to reuse the logic when NLB with SG is supported (future)

The unit tests and documentation(function) comments have been assisted by Cursor AI(model claude-4-sonet): AIA HAb SeCeNc Hin R v1.0

Does this PR introduce a user-facing change?:

Fixed security group leak when updating Classic Load Balancer (CLB) services with `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation. Controller-managed security groups are now properly cleaned up when switching CLB from managed security groups to user-specified security groups.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 15, 2025
@k8s-ci-robot k8s-ci-robot requested review from hakman and kishorj July 15, 2025 17:23
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 15, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2025
@mtulio mtulio changed the title Fix 1208 byosg update fix leak managed/owned security group on Service update with BYO SG Jul 15, 2025
@mtulio mtulio force-pushed the fix-1208-byosg-update branch from 03f9775 to 83c92f2 Compare July 15, 2025 20:41
@elmiko
Copy link
Contributor

elmiko commented Jul 15, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 15, 2025
@mtulio mtulio force-pushed the fix-1208-byosg-update branch from 83c92f2 to 23ba0b3 Compare July 15, 2025 21:11
@mtulio
Copy link
Contributor Author

mtulio commented Jul 15, 2025

/test all

@mtulio mtulio force-pushed the fix-1208-byosg-update branch from 23ba0b3 to 0fec46d Compare July 15, 2025 21:55
@mtulio
Copy link
Contributor Author

mtulio commented Jul 15, 2025

Fixing doc strings and failed unit tests from previous unexpected behavior:

/test all

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 16, 2025
@mtulio mtulio force-pushed the fix-1208-byosg-update branch from 0fec46d to 1907542 Compare July 16, 2025 03:34
@mtulio
Copy link
Contributor Author

mtulio commented Jul 16, 2025

/test pull-cloud-provider-aws-e2e-kubetest2

@mtulio
Copy link
Contributor Author

mtulio commented Jul 16, 2025

/test all

@mtulio
Copy link
Contributor Author

mtulio commented Jul 16, 2025

I can't find connection between failures in pull-cloud-provider-aws-e2e-kubetest2 and existing changes.

I am going to convert to regular PR to ask for reviewers while we observe if this isnt a CI flake.

PTAL?
/assign @kmala @elmiko @JoelSpeed

@mtulio
Copy link
Contributor Author

mtulio commented Oct 2, 2025

FWIW interim update, this PR is still alive and need to be fixed, and proposal could be used in the logic of BYOSG in NLBs. I am planning to return on it next week to rebase and ask for final review with recent updates in the Service NLB and e2e.

@mtulio mtulio force-pushed the fix-1208-byosg-update branch from f0b38b6 to e236025 Compare October 21, 2025 13:39
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from elmiko. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mtulio
Copy link
Contributor Author

mtulio commented Oct 21, 2025

PR rebased including/merging managed NLB SG, ensuring readiness with new e2e tests validating the fix:

/test pull-cloud-provider-aws-e2e

@mtulio mtulio force-pushed the fix-1208-byosg-update branch from e236025 to 3cf5c63 Compare October 21, 2025 14:16
@mtulio
Copy link
Contributor Author

mtulio commented Oct 21, 2025

Fixed conflicts
/test pull-cloud-provider-aws-e2e

@mtulio mtulio force-pushed the fix-1208-byosg-update branch from 3cf5c63 to e98e826 Compare October 21, 2025 19:31
@mtulio
Copy link
Contributor Author

mtulio commented Oct 21, 2025

small fixes in the e2e for PSP and logging:

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Oct 21, 2025

e2e is now passing, I am going to polish the e2e. Let’s see other overall:

/test all

@mtulio mtulio force-pushed the fix-1208-byosg-update branch from e98e826 to ca1ef98 Compare December 23, 2025 19:51
@mtulio
Copy link
Contributor Author

mtulio commented Dec 23, 2025

Code Review - Claude Assistant (claude-sonnet-4.5@20250929)


Summary

This PR addresses security group leakage (#1208) when updating Classic Load Balancer services from controller-managed to user-specified security groups via the service.beta.kubernetes.io/aws-load-balancer-security-groups annotation.

Core Change: Introduces lifecycle management for controller-owned security groups during load balancer updates, ensuring proper cleanup when security groups are replaced.


Problem Analysis

Current Behavior:
When the BYO security group annotation is added to an existing Service, the controller applies new security groups but leaves orphaned managed security groups in AWS.

Impact:

  • Resource leaks consuming AWS quotas
  • Potential security exposure from unmanaged ingress rules
  • Manual cleanup burden on operators

Root Cause:
The ensureLoadBalancer function in aws_loadbalancer.go replaces security groups via ApplySecurityGroupsToLoadBalancer but lacks cleanup logic for the replaced security groups.


Solution Design

New Functions

1. buildSecurityGroupRuleReferences()

Purpose: Identifies security groups with ingress rules referencing a target security group.

Returns:

  • Map of all referencing security groups → cluster tag status
  • Map of cluster-tagged security groups → their relevant IP permissions

Usage: Called before security group deletion to identify and revoke dependencies.

Code Reuse: Replaces duplicated logic in updateInstanceSecurityGroupsForLoadBalancer (aws.go:2940).

2. removeOwnedSecurityGroups()

Purpose: Safely removes controller-owned security groups after validation and dependency cleanup.

Algorithm:

  1. Validate ownership via cluster tags (kubernetes.io/cluster/<name>: owned)
  2. Query dependent security groups
  3. Revoke ingress rules from cluster-tagged dependents
  4. Delete owned security groups via deleteSecurityGroupsWithBackoff

Error Handling: Collects all errors during processing, continues with remaining security groups.

Integration Point

Modified ensureLoadBalancer() to call removeOwnedSecurityGroups() after ApplySecurityGroupsToLoadBalancer successfully updates security groups.


Code Quality

Strengths

  1. Separation of Concerns: Clean functional decomposition with single-responsibility functions
  2. Error Handling: Proper error wrapping using %w for error chain preservation
  3. Defensive Programming: Validates ownership and cluster tags before destructive operations
  4. Test Coverage: 428 lines of unit tests covering owned/non-owned scenarios, errors, and edge cases
  5. E2E Testing: Comprehensive scenario testing with resource cleanup validation
  6. Documentation: Clear GoDoc with parameter and return value descriptions

Code Review Findings

Minor Improvements Implemented

  1. Error Wrapping Consistency (aws_loadbalancer.go:1906)

    • Changed %q to %w for proper error unwrapping support
    • Enables use of errors.Is() and errors.As() for error inspection
  2. Logging Enhancement (aws_loadbalancer.go:1992)

    • Added deletion count to log message: "deleted %d loadbalancer owned security groups"
    • Improves debugging and operational visibility

Retry Logic Analysis

Question: Should retry logic be added for RevokeSecurityGroupIngress?

Answer: No. The AWS SDK v2 for Go includes automatic retry logic with exponential backoff for:

  • Throttling errors (429)
  • Transient network errors
  • 5xx server errors
  • Timeout errors

Evidence: EC2 client initialized via ec2.NewFromConfig(cfg) (aws_sdk.go) inherits default retry configuration from the SDK, which implements the standard retry strategy.

Current Behavior: If RevokeSecurityGroupIngress fails due to transient errors, the SDK retries automatically. If it permanently fails, the error is collected but doesn't block security group deletion. The deleteSecurityGroupsWithBackoff function handles dependency violations during deletion with its own retry mechanism (10-minute timeout, 5-second intervals).

Conclusion: Adding manual retry logic would be redundant and potentially interfere with SDK behavior.


Testing Strategy

Unit Tests (428 lines)

TestCloud_removeOwnedSecurityGroups:

  • Successfully remove owned security groups
  • Skip non-owned security groups
  • Handle ownership validation errors
  • Handle empty security group lists

TestCloud_buildSecurityGroupRuleReferences:

  • Successfully build references with cluster-tagged groups
  • Handle non-cluster-tagged groups
  • Handle API errors
  • Handle no results

E2E Tests (860 net lines)

New Test Scenario: "CLB with managed Security Group must update to BYO Security Group"

Test Flow:

  1. Create Service with default (managed) security groups
  2. Update Service annotation with BYO security group
  3. Verify BYO security group is applied to load balancer
  4. Verify managed security groups are deleted from AWS
  5. Cleanup test resources

Test Infrastructure: New aws_helper.go provides utilities for:

  • Cluster tag discovery
  • Security group management
  • Load balancer inspection
  • Rule verification

Logging Strategy

Production Code:

  • klog.V(2).Infof() for internal operations (revoke rules, deletion steps)
  • klog.Warningf() for operational decisions (skipping non-owned SGs)

Rationale: Follows Kubernetes logging conventions where V(2) is the recommended minimum for troubleshooting. Operators run with --v=1 by default for clean logs, increasing to --v=2 when debugging security group issues.

E2E Tests:

  • Reduced from ~100 to ~71 log statements (29% reduction)
  • Retained: phase transitions, errors, important state verification
  • Removed: routine operation confirmations, iteration details

Security Considerations

  1. Ownership Validation: Checks cluster tag before deletion (kubernetes.io/cluster/<name>: owned)
  2. Dependency Resolution: Revokes ingress rules before deletion to prevent dependency violations
  3. Selective Deletion: Only deletes controller-owned resources, warns about user-managed resources
  4. Error Boundaries: Failures in one security group don't block processing of others

Potential Concerns

1. Race Conditions

Scenario: Security group deleted between buildSecurityGroupRuleReferences and RevokeSecurityGroupIngress calls.

Mitigation: Errors are collected but don't fail the operation. The deleteSecurityGroupsWithBackoff retry logic handles timing issues.

2. Partial Failures

Scenario: Some security groups fail to delete due to permissions or AWS limits.

Behavior: Errors are returned to caller, logged, but Service update succeeds. Security groups remain for manual investigation.

Alternative Consideration: Could return error to retry entire operation, but current design prioritizes Service availability over cleanup completeness.


Recommendations

Approve with Observations

Recommendation: Approve and merge

Confidence: High - code quality is strong, testing is comprehensive, and the approach is sound.

Observations for Future Work:

  1. Metrics: Consider adding Prometheus metrics for security group cleanup operations (success/failure counts, duration)

  2. Documentation: Add troubleshooting section to docs for security group cleanup failures

  3. Async Cleanup: For production systems with many security groups, consider whether cleanup should be asynchronous to avoid blocking Service updates


Technical Verification

  • Code compiles successfully
  • Unit tests added with good coverage
  • E2E tests validate full scenario
  • No breaking API changes
  • Error handling follows Go best practices
  • Logging follows Kubernetes conventions
  • AWS SDK best practices followed (automatic retries, error inspection)

Reviewed by: Claude Code Assistant (claude-sonnet-4.5@20250929)
Date: 2025-12-23
Recommendation: APPROVE

@mtulio mtulio force-pushed the fix-1208-byosg-update branch from ca1ef98 to 63e7396 Compare December 23, 2025 20:57
@mtulio mtulio marked this pull request as ready for review December 23, 2025 20:57
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 23, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mmerkes December 23, 2025 20:57
@mtulio
Copy link
Contributor Author

mtulio commented Dec 23, 2025

initial investigation is pointing to flake test, checking:

/test pull-cloud-provider-aws-test

mtulio added a commit to mtulio/openshift-cloud-provider-aws that referenced this pull request Dec 24, 2025
This commit fixes a security group leak vulnerability in NLB (Network Load
Balancer) when updating a service from controller-managed security groups
to BYO (Bring Your Own) security groups.

Problem:
When updating an NLB service to use BYO security groups via the
service.beta.kubernetes.io/aws-load-balancer-security-groups annotation,
the old managed security groups were not being deleted from AWS, causing
resource leaks.

Solution:
1. Added security group drift detection in ensureLoadBalancerv2()
   - Compares expected vs actual SGs on every reconciliation
   - Calls AWS SetSecurityGroups API when drift detected
   - Triggers cleanup of replaced managed security groups

2. Added helper function buildSecurityGroupRuleReferences()
   - Finds security groups with ingress rules referencing target SG
   - Identifies cluster-owned vs external security groups
   - Returns permissions that need revocation before deletion

3. Added helper function removeOwnedSecurityGroups()
   - Verifies SG ownership via cluster tags
   - Revokes dependent ingress rules from other SGs
   - Deletes owned SGs with exponential backoff retry

Testing:
- Added comprehensive unit tests (TestEnsureLoadBalancerv2_SecurityGroupUpdate)
- Added E2E test for managed→BYO SG update scenario
- All existing tests pass

This fix follows the same pattern used for CLB in PR kubernetes#1209.

Fixes: Security group leak when updating NLB from managed to BYO SG
Related: kubernetes#1209 (CLB security group leak fix)
mtulio added a commit to mtulio/openshift-cloud-provider-aws that referenced this pull request Jan 5, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Jan 8, 2026

Introduced test is passing:

[cloud-provider-aws-e2e] loadbalancer CLB with managed Security Group must update to BYO Security Group

Failed tests, both hairpinning traffic, are failing to resolve DNS:

DNS resolution failure - check if target hostname is resolvable

Checking if this would be transient ('cloudability' issues) or related to e2e updates (which is mostly debug and new e2e).

@mtulio
Copy link
Contributor Author

mtulio commented Jan 8, 2026

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jan 8, 2026

The last attempt only the NLB test has failed for same reason as before (timeout):

[It] [cloud-provider-aws-e2e] loadbalancer NLB internal should be reachable with hairpinning traffic

I wonder if I need to consider increasing the timeout, although I think it is already high.

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jan 8, 2026

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-e2e 63e7396 link true /test pull-cloud-provider-aws-e2e
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

The hairpinning trafic test(s) keeps flaking due timeout (to have a LB/resolve it's name). I had a good sync with @elmiko today, one approach would be increasing timeout of hairpinning traffic tests. I am also considering isolating the e2e improvements added on this PR to a dedicated one, so we can focus here in the fix part of things, while we investigate /isolate the e2e improvements. Open for thoughts.

Fix the managed (controller-owned) security group leak when
user provided security group is added to an existing
Service type-loadBalancer CLB.

fix/byosg/tests: unit tests to handle managed SG removal on BYOSG.

Introduce unit tests for functions added to validate Service update
to BYO Security Group annotations from a managed SG state.
Introduce BYO Security Group(SG) update scenario to Service CLB to validate
SG leak when user has created a Service CLB with default SG and
eventually updated to a user-provided.

kubernetes#1208
@mtulio mtulio force-pushed the fix-1208-byosg-update branch from 63e7396 to 3e497b2 Compare January 12, 2026 14:28
@mtulio
Copy link
Contributor Author

mtulio commented Jan 12, 2026

e2e timeout increased on loadbalancer curller/pooler to validate if CI issues are related. Expected to decrease the flake between hairpin traffic tests (CLB and NLB)

@mtulio
Copy link
Contributor Author

mtulio commented Jan 16, 2026

/test pull-cloud-provider-aws-e2e

@k8s-ci-robot
Copy link
Contributor

@mtulio: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-test 3e497b2 link true /test pull-cloud-provider-aws-test
pull-cloud-provider-aws-e2e 3e497b2 link true /test pull-cloud-provider-aws-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Managed security group leak after annotation service.beta.kubernetes.io/aws-load-balancer-security-groups added to existing service

5 participants