roachprod: add support for create/list/destroy load balancer (AWS) #160382

cpj2195 · 2026-01-02T10:18:54Z

This change adds Network Load Balancer (NLB) support for AWS clusters by
implementing the CreateLoadBalancer, DeleteLoadBalancer, and
ListLoadBalancers methods on the AWS provider.

The implementation:

Creates a regional NLB with a TCP listener on the specified port
Creates a target group with health checks and registers all cluster VMs
Uses a naming convention {cluster}-{port}-{type}-roachprod (truncated to
32 chars due to AWS limits)
Handles multi-region clusters by creating an NLB in each region
Cleans up all associated resources (listeners, target groups) on deletion
Supports deleting all load balancers for a cluster

Usage

Create a load balancer for SQL connections

roachprod load-balancer create <$cluster> --secure

List load balancers and view connection info

roachprod load-balancer list <$cluster>
roachprod load-balancer pgurl <$cluster>
roachprod load-balancer ip <$cluster>

Get connection URL through load balancer

roachprod fetch-certs <$cluster>

Connect to cluster node via load balancer

eval ./cockroach sql --url=$(roachprod load-balancer pgurl <$cluster> --secure)

Delete all load balancers

roachprod load-balancer destroy <$cluster>

Fixes: #54176
Epic: None
Release Note: None

cockroach-teamcity · 2026-01-02T10:19:06Z

This change is

github-actions · 2026-01-05T09:06:43Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

shailendra-patel

@shailendra-patel made 2 comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cpj2195 and @herkolategan).

pkg/roachprod/vm/aws/aws.go line 2277 at r2 (raw file):

	nlbName := loadBalancerResourceName(clusterName, port, "nlb")
	// AWS NLB names have a 32-character limit
	if len(nlbName) > 32 {

nit: Check for length on name is not required, as function loadBalancerResourceName already trim the name if length > 32.

pkg/roachprod/vm/aws/aws.go line 2287 at r2 (raw file):

		"--type", "network",
		"--scheme", "internet-facing",
		"--subnets",

We should consider adding a security group for NLB, this will control the inbound and outbound traffic on the NLB. I think the default security group available in zone configs should work fine.

--security-groups # aws cli flag 
// az, ok := p.Config.AZByName[v.Zone]
// az.Region.SecurityGroup this will give the security group

shailendra-patel · 2026-01-06T06:29:54Z

The current implementation registers each EC2 instance with the target groups. This works well for clusters where we do not scale EC2 instances up or down. However, for scale tests, we may want to consider implementing the following as part of separate PRs:

Support AWS Auto Scaling Groups for creating and managing instances.
Add support for AWS in the roachprod grow command.
Create AWS NLBs with Auto Scaling Groups as target groups.

Additionally, in AWS, NLBs are regional. For a multi-region cluster, you will have one NLB per region, each with a different endpoint. To simulate a regional failure, there is no single NLB endpoint that can be used as a pgurl. Therefore, we should also consider using AWS Global Accelerator on top of NLBs to support this requirement in future AWS scale tests.

This comment is not a blocker for this PR, in my opinion we need to complete the above item in order to close 153072 fully.

cpj2195 · 2026-01-06T13:32:32Z

The current implementation registers each EC2 instance with the target groups. This works well for clusters where we do not scale EC2 instances up or down. However, for scale tests, we may want to consider implementing the following as part of separate PRs:

Support AWS Auto Scaling Groups for creating and managing instances.

Create AWS NLBs with Auto Scaling Groups as target groups.

Additionally, in AWS, NLBs are regional. For a multi-region cluster, you will have one NLB per region, each with a different endpoint. To simulate a regional failure, there is no single NLB endpoint that can be used as a pgurl. Therefore, we should also consider using AWS Global Accelerator on top of NLBs to support this requirement in future AWS scale tests.

This comment is not a blocker for this PR, in my opinion we need to complete the above item in order to close 153072 fully.

The support for addition of ASG is already part of this epic here. As for the multi region AWS cluster support, I am not able to provision a multi region roachprod cluster in aws from master as of now. I will add a new ticket to this EPIC for multi region support and try to take it up as part of that.

Add support for AWS in the roachprod grow command.

Will sync with you on this offline

cpj2195 · 2026-01-07T05:37:09Z

We should consider adding a security group for NLB, this will control the inbound and outbound traffic on the NLB. I think the default security group available in zone configs should work fine.

The VM's already have security groups so any traffic will get filtered down on the VM level. Also we are not really restricting any specific traffic in the zonal config SGs so why to add an additional infrastructure component?
what do you think?

shailendra-patel

LGTM

herkolategan

Nice work! I have a few comments around ensuring we don't leak resources, and how clean-up is managed.

And then just a general question around how connections are distributed - are they round-robin?

pkg/roachprod/vm/aws/aws.go

golgeek

I have the same general concern @herkolategan already raised: we need to make sure that we clean up properly in case something goes wrong in the multi-step creation/deletion process. It seems like the DeleteLoadBalancer() function will properly destroy elbv2 and target-groups even if one type of resource has already been destroyed (or was never created), but we need to ensure that DeleteLoadBalancer() is properly called when deleting a cluster and when something goes wrong during LB creation.

Something else I'd like to raise: since these are new functions, you should equip them with a context.Context and pass it down the line (use ctxgroup instead of errgroup and call runJSONCommandWithContext()) as this will help in user's cancellations and operations timeout in the future.

One last thing, the pattern of "listing resources then getting elbv2 tags" is repeated multiple times. I wonder if you shouldn't move this logic to a helper function to avoid code duplication.

golgeek · 2026-01-08T15:45:04Z

pkg/roachprod/vm/aws/aws.go

+		"elbv2", "describe-tags",
+		"--resource-arns",
+	}
+	args = append(args, arns...)


From the awscli doc, it looks like the limit is 20 resources in a single call.
Probably not a huge near term concerns as we don't have a heavy use of LBs, but I think you should consider batching.

Parking the batching as a TODO for now since we dont use LB's currently too much.

cpj2195 · 2026-01-12T08:56:45Z

And then just a general question around how connections are distributed - are they round-robin?

Its flow hash algorithm as per AWS docs.

pkg/roachprod/cloud/cluster_cloud.go

pkg/roachprod/vm/aws/aws.go

filter nlbs based on tags rather than names which can be buggy due 32 char limit addressed PR comments around leniency and graceful cleanup of resources implemented concrete deletecluster for aws moved delete logic from DeleteCLuster to Delete

cpj2195 · 2026-01-20T15:05:25Z

bors r+

craig · 2026-01-20T15:36:40Z

Build succeeded:

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 6c04951 to 47d6cb7 Compare January 5, 2026 07:07

cpj2195 changed the title ~~initial commit~~ roachprod: add support for create/list/destroy load balancer (AWS) Jan 5, 2026

cpj2195 self-assigned this Jan 5, 2026

cpj2195 marked this pull request as ready for review January 5, 2026 09:02

cpj2195 requested a review from a team as a code owner January 5, 2026 09:02

cpj2195 requested review from herkolategan and shailendra-patel and removed request for a team January 5, 2026 09:02

github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Jan 5, 2026

cpj2195 marked this pull request as draft January 5, 2026 09:17

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 47d6cb7 to 8ef71b8 Compare January 5, 2026 13:23

cpj2195 marked this pull request as ready for review January 5, 2026 14:59

shailendra-patel reviewed Jan 6, 2026

View reviewed changes

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 8ef71b8 to 07d6065 Compare January 7, 2026 05:42

shailendra-patel approved these changes Jan 7, 2026

View reviewed changes

srosenberg requested a review from golgeek January 8, 2026 03:07

herkolategan reviewed Jan 8, 2026

View reviewed changes

pkg/roachprod/vm/aws/aws.go Outdated Show resolved Hide resolved

pkg/roachprod/vm/aws/aws.go Outdated Show resolved Hide resolved

pkg/roachprod/vm/aws/aws.go Outdated Show resolved Hide resolved

pkg/roachprod/vm/aws/aws.go Show resolved Hide resolved

golgeek reviewed Jan 8, 2026

View reviewed changes

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 07d6065 to b3ba7cf Compare January 12, 2026 08:55

cpj2195 requested a review from herkolategan January 12, 2026 08:57

herkolategan reviewed Jan 13, 2026

View reviewed changes

pkg/roachprod/cloud/cluster_cloud.go Outdated Show resolved Hide resolved

cpj2195 requested a review from golgeek January 14, 2026 07:58

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from b3ba7cf to 30bdc01 Compare January 16, 2026 05:29

cpj2195 requested a review from herkolategan January 16, 2026 05:29

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 30bdc01 to 2464a9e Compare January 19, 2026 03:53

golgeek reviewed Jan 19, 2026

View reviewed changes

pkg/roachprod/vm/aws/aws.go Outdated Show resolved Hide resolved

cpj2195 force-pushed the roachprod/add_loadbalancer_to_AWS branch from 2464a9e to a7c53b6 Compare January 20, 2026 07:54

cpj2195 requested a review from golgeek January 20, 2026 11:35

golgeek approved these changes Jan 20, 2026

View reviewed changes

craig bot merged commit a58ac7e into cockroachdb:master Jan 20, 2026
37 of 38 checks passed

celeste-cockroachdb bot added the target-release-26.2.0 label Jan 20, 2026

roachprod: add support for create/list/destroy load balancer (AWS) #160382

roachprod: add support for create/list/destroy load balancer (AWS) #160382

Conversation

cpj2195 commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Create a load balancer for SQL connections

List load balancers and view connection info

Get connection URL through load balancer

Connect to cluster node via load balancer

Delete all load balancers

Uh oh!

cockroach-teamcity commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Potential Bug(s) Detected

Uh oh!

shailendra-patel left a comment

Choose a reason for hiding this comment

Uh oh!

shailendra-patel commented Jan 6, 2026

Uh oh!

cpj2195 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpj2195 commented Jan 7, 2026

Uh oh!

shailendra-patel left a comment

Choose a reason for hiding this comment

Uh oh!

herkolategan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

golgeek left a comment

Choose a reason for hiding this comment

Uh oh!

golgeek Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

cpj2195 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

cpj2195 commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

cpj2195 commented Jan 20, 2026

Uh oh!

craig bot commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cpj2195 commented Jan 2, 2026 •

edited

Loading

cpj2195 commented Jan 6, 2026 •

edited

Loading