-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachprod: add support for create/list/destroy load balancer (AWS) #160382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: add support for create/list/destroy load balancer (AWS) #160382
Conversation
6c04951 to
47d6cb7
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
47d6cb7 to
8ef71b8
Compare
shailendra-patel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shailendra-patel made 2 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cpj2195 and @herkolategan).
pkg/roachprod/vm/aws/aws.go line 2277 at r2 (raw file):
nlbName := loadBalancerResourceName(clusterName, port, "nlb") // AWS NLB names have a 32-character limit if len(nlbName) > 32 {
nit: Check for length on name is not required, as function loadBalancerResourceName already trim the name if length > 32.
pkg/roachprod/vm/aws/aws.go line 2287 at r2 (raw file):
"--type", "network", "--scheme", "internet-facing", "--subnets",
We should consider adding a security group for NLB, this will control the inbound and outbound traffic on the NLB. I think the default security group available in zone configs should work fine.
--security-groups # aws cli flag
// az, ok := p.Config.AZByName[v.Zone]
// az.Region.SecurityGroup this will give the security group
|
The current implementation registers each EC2 instance with the target groups. This works well for clusters where we do not scale EC2 instances up or down. However, for scale tests, we may want to consider implementing the following as part of separate PRs:
Additionally, in AWS, NLBs are regional. For a multi-region cluster, you will have one NLB per region, each with a different endpoint. To simulate a regional failure, there is no single NLB endpoint that can be used as a pgurl. Therefore, we should also consider using AWS Global Accelerator on top of NLBs to support this requirement in future AWS scale tests. This comment is not a blocker for this PR, in my opinion we need to complete the above item in order to close 153072 fully. |
The support for addition of ASG is already part of this epic here. As for the multi region AWS cluster support, I am not able to provision a multi region roachprod cluster in aws from master as of now. I will add a new ticket to this EPIC for multi region support and try to take it up as part of that.
Will sync with you on this offline |
The VM's already have security groups so any traffic will get filtered down on the VM level. Also we are not really restricting any specific traffic in the zonal config SGs so why to add an additional infrastructure component? |
8ef71b8 to
07d6065
Compare
shailendra-patel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
herkolategan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I have a few comments around ensuring we don't leak resources, and how clean-up is managed.
And then just a general question around how connections are distributed - are they round-robin?
golgeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same general concern @herkolategan already raised: we need to make sure that we clean up properly in case something goes wrong in the multi-step creation/deletion process. It seems like the DeleteLoadBalancer() function will properly destroy elbv2 and target-groups even if one type of resource has already been destroyed (or was never created), but we need to ensure that DeleteLoadBalancer() is properly called when deleting a cluster and when something goes wrong during LB creation.
Something else I'd like to raise: since these are new functions, you should equip them with a context.Context and pass it down the line (use ctxgroup instead of errgroup and call runJSONCommandWithContext()) as this will help in user's cancellations and operations timeout in the future.
One last thing, the pattern of "listing resources then getting elbv2 tags" is repeated multiple times. I wonder if you shouldn't move this logic to a helper function to avoid code duplication.
| "elbv2", "describe-tags", | ||
| "--resource-arns", | ||
| } | ||
| args = append(args, arns...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the awscli doc, it looks like the limit is 20 resources in a single call.
Probably not a huge near term concerns as we don't have a heavy use of LBs, but I think you should consider batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parking the batching as a TODO for now since we dont use LB's currently too much.
07d6065 to
b3ba7cf
Compare
Its flow hash algorithm as per AWS docs. |
b3ba7cf to
30bdc01
Compare
30bdc01 to
2464a9e
Compare
filter nlbs based on tags rather than names which can be buggy due 32 char limit addressed PR comments around leniency and graceful cleanup of resources implemented concrete deletecluster for aws moved delete logic from DeleteCLuster to Delete
2464a9e to
a7c53b6
Compare
|
bors r+ |
This change adds Network Load Balancer (NLB) support for AWS clusters by
implementing the CreateLoadBalancer, DeleteLoadBalancer, and
ListLoadBalancers methods on the AWS provider.
The implementation:
32 chars due to AWS limits)
Usage
Create a load balancer for SQL connections
roachprod load-balancer create <$cluster> --secure
List load balancers and view connection info
roachprod load-balancer list <$cluster>
roachprod load-balancer pgurl <$cluster>
roachprod load-balancer ip <$cluster>
Get connection URL through load balancer
roachprod fetch-certs <$cluster>
Connect to cluster node via load balancer
eval ./cockroach sql --url=$(roachprod load-balancer pgurl <$cluster> --secure)
Delete all load balancers
roachprod load-balancer destroy <$cluster>
Fixes: #54176
Epic: None
Release Note: None