Skip to content

fix(aws): surface NLB errors instead of swallowing them#645

Merged
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/nlb-error-handling
Feb 13, 2026
Merged

fix(aws): surface NLB errors instead of swallowing them#645
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/nlb-error-handling

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • DeregisterTargets error in pkg/provider/aws/nlb.go:332 was fully discarded (_, _ =). Now logs it as a warning so failures are diagnosable.
  • deleteNLBForCluster error in pkg/provider/aws/delete.go:104 was only warned about. Now returns the error so callers know about leaked AWS resources and can take action.

Addresses audit findings #9 (MEDIUM) and #10 (MEDIUM).

Test plan

  • gofmt — no formatting issues
  • golangci-lint run ./... — 0 issues
  • go test ./pkg/... — all tests pass
  • go build -o bin/holodeck cmd/cli/main.go — compiles
  • go mod tidy && go mod verify — all modules verified

DeregisterTargets error was fully discarded. Log it as a warning.
deleteNLBForCluster error was only warned — return it so callers
know about leaked AWS resources.

Audit findings NVIDIA#9 (MEDIUM), NVIDIA#10 (MEDIUM).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 12, 2026 19:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves diagnosability and correctness of AWS NLB teardown in Holodeck by ensuring ELBv2 failures aren’t silently ignored during deletion, aligning with the audit findings referenced in the PR description.

Changes:

  • Log a warning when DeregisterTargets fails instead of discarding the error.
  • Propagate deleteNLBForCluster failures to the caller instead of only warning.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/provider/aws/nlb.go Surfaces ELBv2 deregistration failures via warning logs during target group deletion.
pkg/provider/aws/delete.go Changes delete flow to return NLB deletion errors to the caller for visibility.

Comment on lines 104 to 106
if err := p.deleteNLBForCluster(clusterCache); err != nil {
p.log.Warning("Error deleting load balancer: %v", err)
return fmt.Errorf("failed to delete load balancer (resources may be leaked): %w", err)
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning immediately on load balancer deletion failure stops the rest of the teardown (EC2 instances, security groups, VPC). This can leak more resources than just the NLB. Consider continuing with phases 1–3 and returning an aggregated error at the end (or track the NLB error and return it after other cleanup completes).

Copilot uses AI. Check for mistakes.
Comment on lines 104 to 106
if err := p.deleteNLBForCluster(clusterCache); err != nil {
p.log.Warning("Error deleting load balancer: %v", err)
return fmt.Errorf("failed to delete load balancer (resources may be leaked): %w", err)
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to propagate deleteNLBForCluster failures alters delete() control flow; there are existing tests in pkg/provider/aws (including delete_test.go), but no unit test asserting the new behavior (e.g., delete returns an error when NLB deletion fails, and/or that other cleanup phases still run). Please add coverage for this path to prevent regressions.

Copilot generated this review using guidance from repository custom instructions.
@coveralls
Copy link

Pull Request Test Coverage Report for Build 21961633716

Details

  • 0 of 4 (0.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.009%) to 47.492%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/delete.go 0 1 0.0%
pkg/provider/aws/nlb.go 0 3 0.0%
Totals Coverage Status
Change from base Build 21955389842: -0.009%
Covered Lines: 2500
Relevant Lines: 5264

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez merged commit 26a89cd into NVIDIA:main Feb 13, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants