fix(aws): surface NLB errors instead of swallowing them#645
fix(aws): surface NLB errors instead of swallowing them#645ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves diagnosability and correctness of AWS NLB teardown in Holodeck by ensuring ELBv2 failures aren’t silently ignored during deletion, aligning with the audit findings referenced in the PR description.
Changes:
- Log a warning when
DeregisterTargetsfails instead of discarding the error. - Propagate
deleteNLBForClusterfailures to the caller instead of only warning.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/provider/aws/nlb.go | Surfaces ELBv2 deregistration failures via warning logs during target group deletion. |
| pkg/provider/aws/delete.go | Changes delete flow to return NLB deletion errors to the caller for visibility. |
| if err := p.deleteNLBForCluster(clusterCache); err != nil { | ||
| p.log.Warning("Error deleting load balancer: %v", err) | ||
| return fmt.Errorf("failed to delete load balancer (resources may be leaked): %w", err) | ||
| } |
There was a problem hiding this comment.
Returning immediately on load balancer deletion failure stops the rest of the teardown (EC2 instances, security groups, VPC). This can leak more resources than just the NLB. Consider continuing with phases 1–3 and returning an aggregated error at the end (or track the NLB error and return it after other cleanup completes).
| if err := p.deleteNLBForCluster(clusterCache); err != nil { | ||
| p.log.Warning("Error deleting load balancer: %v", err) | ||
| return fmt.Errorf("failed to delete load balancer (resources may be leaked): %w", err) | ||
| } |
There was a problem hiding this comment.
The change to propagate deleteNLBForCluster failures alters delete() control flow; there are existing tests in pkg/provider/aws (including delete_test.go), but no unit test asserting the new behavior (e.g., delete returns an error when NLB deletion fails, and/or that other cleanup phases still run). Please add coverage for this path to prevent regressions.
Pull Request Test Coverage Report for Build 21961633716Details
💛 - Coveralls |
Summary
pkg/provider/aws/nlb.go:332was fully discarded (_, _ =). Now logs it as a warning so failures are diagnosable.pkg/provider/aws/delete.go:104was only warned about. Now returns the error so callers know about leaked AWS resources and can take action.Addresses audit findings #9 (MEDIUM) and #10 (MEDIUM).
Test plan
gofmt— no formatting issuesgolangci-lint run ./...— 0 issuesgo test ./pkg/...— all tests passgo build -o bin/holodeck cmd/cli/main.go— compilesgo mod tidy && go mod verify— all modules verified