Skip to content

Conversation

@cpj2195
Copy link
Contributor

@cpj2195 cpj2195 commented Jan 2, 2026

This change adds Network Load Balancer (NLB) support for AWS clusters by
implementing the CreateLoadBalancer, DeleteLoadBalancer, and
ListLoadBalancers methods on the AWS provider.

The implementation:

  • Creates a regional NLB with a TCP listener on the specified port
  • Creates a target group with health checks and registers all cluster VMs
  • Uses a naming convention {cluster}-{port}-{type}-roachprod (truncated to
    32 chars due to AWS limits)
  • Handles multi-region clusters by creating an NLB in each region
  • Cleans up all associated resources (listeners, target groups) on deletion
  • Supports deleting all load balancers for a cluster

Usage

Create a load balancer for SQL connections

roachprod load-balancer create <$cluster> --secure

List load balancers and view connection info

roachprod load-balancer list <$cluster>
roachprod load-balancer pgurl <$cluster>
roachprod load-balancer ip <$cluster>

Get connection URL through load balancer

roachprod fetch-certs <$cluster>

Connect to cluster node via load balancer

eval ./cockroach sql --url=$(roachprod load-balancer pgurl <$cluster> --secure)

Delete all load balancers

roachprod load-balancer destroy <$cluster>

Fixes: #54176
Epic: None
Release Note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 6c04951 to 47d6cb7 Compare January 5, 2026 07:07
@cpj2195 cpj2195 changed the title initial commit roachprod: add support for create/list/destroy load balancer (AWS) Jan 5, 2026
@cpj2195 cpj2195 self-assigned this Jan 5, 2026
@cpj2195 cpj2195 marked this pull request as ready for review January 5, 2026 09:02
@cpj2195 cpj2195 requested a review from a team as a code owner January 5, 2026 09:02
@cpj2195 cpj2195 requested review from herkolategan and shailendra-patel and removed request for a team January 5, 2026 09:02
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Jan 5, 2026
@cpj2195 cpj2195 marked this pull request as draft January 5, 2026 09:17
@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 47d6cb7 to 8ef71b8 Compare January 5, 2026 13:23
@cpj2195 cpj2195 marked this pull request as ready for review January 5, 2026 14:59
Copy link
Contributor

@shailendra-patel shailendra-patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shailendra-patel made 2 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cpj2195 and @herkolategan).


pkg/roachprod/vm/aws/aws.go line 2277 at r2 (raw file):

	nlbName := loadBalancerResourceName(clusterName, port, "nlb")
	// AWS NLB names have a 32-character limit
	if len(nlbName) > 32 {

nit: Check for length on name is not required, as function loadBalancerResourceName already trim the name if length > 32.


pkg/roachprod/vm/aws/aws.go line 2287 at r2 (raw file):

		"--type", "network",
		"--scheme", "internet-facing",
		"--subnets",

We should consider adding a security group for NLB, this will control the inbound and outbound traffic on the NLB. I think the default security group available in zone configs should work fine.

--security-groups # aws cli flag 
// az, ok := p.Config.AZByName[v.Zone]
// az.Region.SecurityGroup this will give the security group

@shailendra-patel
Copy link
Contributor

The current implementation registers each EC2 instance with the target groups. This works well for clusters where we do not scale EC2 instances up or down. However, for scale tests, we may want to consider implementing the following as part of separate PRs:

  1. Support AWS Auto Scaling Groups for creating and managing instances.
  2. Add support for AWS in the roachprod grow command.
  3. Create AWS NLBs with Auto Scaling Groups as target groups.

Additionally, in AWS, NLBs are regional. For a multi-region cluster, you will have one NLB per region, each with a different endpoint. To simulate a regional failure, there is no single NLB endpoint that can be used as a pgurl. Therefore, we should also consider using AWS Global Accelerator on top of NLBs to support this requirement in future AWS scale tests.

This comment is not a blocker for this PR, in my opinion we need to complete the above item in order to close 153072 fully.

@cpj2195
Copy link
Contributor Author

cpj2195 commented Jan 6, 2026

The current implementation registers each EC2 instance with the target groups. This works well for clusters where we do not scale EC2 instances up or down. However, for scale tests, we may want to consider implementing the following as part of separate PRs:

  1. Support AWS Auto Scaling Groups for creating and managing instances.
  1. Create AWS NLBs with Auto Scaling Groups as target groups.

Additionally, in AWS, NLBs are regional. For a multi-region cluster, you will have one NLB per region, each with a different endpoint. To simulate a regional failure, there is no single NLB endpoint that can be used as a pgurl. Therefore, we should also consider using AWS Global Accelerator on top of NLBs to support this requirement in future AWS scale tests.

This comment is not a blocker for this PR, in my opinion we need to complete the above item in order to close 153072 fully.

The support for addition of ASG is already part of this epic here. As for the multi region AWS cluster support, I am not able to provision a multi region roachprod cluster in aws from master as of now. I will add a new ticket to this EPIC for multi region support and try to take it up as part of that.

  1. Add support for AWS in the roachprod grow command.

Will sync with you on this offline

@cpj2195
Copy link
Contributor Author

cpj2195 commented Jan 7, 2026

We should consider adding a security group for NLB, this will control the inbound and outbound traffic on the NLB. I think the default security group available in zone configs should work fine.

The VM's already have security groups so any traffic will get filtered down on the VM level. Also we are not really restricting any specific traffic in the zonal config SGs so why to add an additional infrastructure component?
what do you think?

@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 8ef71b8 to 07d6065 Compare January 7, 2026 05:42
Copy link
Contributor

@shailendra-patel shailendra-patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srosenberg srosenberg requested a review from golgeek January 8, 2026 03:07
Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have a few comments around ensuring we don't leak resources, and how clean-up is managed.

And then just a general question around how connections are distributed - are they round-robin?

Copy link
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same general concern @herkolategan already raised: we need to make sure that we clean up properly in case something goes wrong in the multi-step creation/deletion process. It seems like the DeleteLoadBalancer() function will properly destroy elbv2 and target-groups even if one type of resource has already been destroyed (or was never created), but we need to ensure that DeleteLoadBalancer() is properly called when deleting a cluster and when something goes wrong during LB creation.

Something else I'd like to raise: since these are new functions, you should equip them with a context.Context and pass it down the line (use ctxgroup instead of errgroup and call runJSONCommandWithContext()) as this will help in user's cancellations and operations timeout in the future.

One last thing, the pattern of "listing resources then getting elbv2 tags" is repeated multiple times. I wonder if you shouldn't move this logic to a helper function to avoid code duplication.

"elbv2", "describe-tags",
"--resource-arns",
}
args = append(args, arns...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the awscli doc, it looks like the limit is 20 resources in a single call.
Probably not a huge near term concerns as we don't have a heavy use of LBs, but I think you should consider batching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parking the batching as a TODO for now since we dont use LB's currently too much.

@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 07d6065 to b3ba7cf Compare January 12, 2026 08:55
@cpj2195
Copy link
Contributor Author

cpj2195 commented Jan 12, 2026

And then just a general question around how connections are distributed - are they round-robin?

Its flow hash algorithm as per AWS docs.

@cpj2195 cpj2195 requested a review from herkolategan January 12, 2026 08:57
@cpj2195 cpj2195 requested a review from golgeek January 14, 2026 07:58
@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from b3ba7cf to 30bdc01 Compare January 16, 2026 05:29
@cpj2195 cpj2195 requested a review from herkolategan January 16, 2026 05:29
@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 30bdc01 to 2464a9e Compare January 19, 2026 03:53
filter nlbs based on tags rather than names which can be buggy due 32 char limit

addressed PR comments around leniency and graceful cleanup of resources

implemented concrete deletecluster for aws

moved delete logic from DeleteCLuster to Delete
@cpj2195 cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 2464a9e to a7c53b6 Compare January 20, 2026 07:54
@cpj2195 cpj2195 requested a review from golgeek January 20, 2026 11:35
@cpj2195
Copy link
Contributor Author

cpj2195 commented Jan 20, 2026

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 20, 2026

@craig craig bot merged commit a58ac7e into cockroachdb:master Jan 20, 2026
37 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachprod: add load balancer support (AWS)

5 participants