Feat/Kubernetes Private Cluster by asadaaron · Pull Request #1300 · rhesis-ai/rhesis

asadaaron · 2026-02-10T16:51:57Z

This PR introduces changes from the feat/kubernete-private-cluster branch.

📝 Summary

📁 Files Changed ( 27 files)

.gitignore
terraform/infrastructure/envs/dev/main.tf
terraform/infrastructure/envs/dev/outputs.tf
terraform/infrastructure/envs/dev/variables.tf
terraform/infrastructure/envs/prd/main.tf
terraform/infrastructure/envs/prd/outputs.tf
terraform/infrastructure/envs/prd/variables.tf
terraform/infrastructure/envs/stg/main.tf
terraform/infrastructure/envs/stg/outputs.tf
terraform/infrastructure/envs/stg/variables.tf
terraform/infrastructure/main.tf
terraform/infrastructure/modules/kubernetes/azure/main.tf
terraform/infrastructure/modules/kubernetes/azure/outputs.tf
terraform/infrastructure/modules/kubernetes/azure/variables.tf
terraform/infrastructure/modules/kubernetes/gcp/cluster.tf
terraform/infrastructure/modules/kubernetes/gcp/firewall.tf
terraform/infrastructure/modules/kubernetes/gcp/main.tf
terraform/infrastructure/modules/kubernetes/gcp/node_pool.tf
terraform/infrastructure/modules/kubernetes/gcp/outputs.tf
terraform/infrastructure/modules/kubernetes/gcp/service_account.tf
terraform/infrastructure/modules/kubernetes/gcp/variables.tf
terraform/infrastructure/modules/network/gcp/firewall.tf
terraform/infrastructure/modules/network/gcp/main.tf
terraform/infrastructure/modules/network/gcp/nat.tf
terraform/infrastructure/modules/network/gcp/outputs.tf
terraform/infrastructure/modules/network/gcp/subnets.tf
terraform/infrastructure/outputs.tf

📋 Commit Details

bf3eeebf - fix: include tfplan into the gitignore. (Md Asaduzzaman Miah, 2026-02-10 13:26)
7d808e71 - fix: All cluster resources were brought up, but: only 0 nodes out of 3 have registered. (Md Asaduzzaman Miah, 2026-02-10 10:56)
338d9b44 - kubernetes private cluster (Md Asaduzzaman Miah, 2026-02-09 16:43)

✅ Checklist

Code follows the project's style guidelines
Self-review of code has been performed
Code is commented, particularly in hard-to-understand areas
Corresponding changes to documentation have been made
Tests have been added/updated for new functionality
All tests pass locally

🧪 Testing

📸 Screenshots (if applicable)

🔗 Related Issues

…3 have registered.

harry-rhesis · 2026-02-11T16:46:09Z

Overall Review

Great work on this PR, Asad! Setting up private GKE clusters with proper network isolation is non-trivial infrastructure work, and the foundations here are solid. The module structure is clean and well-organized (cluster.tf, node_pool.tf, firewall.tf, service_account.tf -- nice separation of concerns), the IP addressing scheme is well-planned with no CIDR overlaps, and I appreciate the thoughtful decisions like Workload Identity, shielded nodes, appropriate machine sizing per environment, and deletion_protection = true on prd.

The Cloud NAT addition and the correct removal of the master subnet (with the explanatory comment) show a good understanding of how GKE private clusters interact with the network layer.

I've left a few prioritized comments below -- some are things that could bite us in production if not addressed now, others are hardening suggestions and minor cleanups. Nothing here takes away from the quality of the overall design.

harry-rhesis · 2026-02-11T16:46:25Z

P0 — Standalone env deploys create unreachable GKE clusters

Files: envs/dev/main.tf, envs/stg/main.tf, envs/prd/main.tf

The env configs say "Standalone ... network (no peering). For full deploy with peerings run from infrastructure/" — but they now include the GKE module with enable_private_endpoint = true and only the WireGuard CIDR as an authorized network.

In a standalone deploy (no peering to the WireGuard VPC):

There is no route from WireGuard to this VPC
The only authorized network is the WireGuard CIDR
There is no public API endpoint (enable_private_endpoint = true)

This creates a cluster that nobody can reach — not via VPN (no peering), not via public internet (private only). The cluster becomes unmanageable immediately after creation.

Suggested fix: Either:

Don't include the GKE module in standalone env configs (keep GKE only in root main.tf where peerings exist), or
Add the node subnet CIDR as an additional authorized network so at least in-VPC access works, or
Add a variable like enable_public_endpoint that defaults to true in standalone mode and false in root mode

harry-rhesis · 2026-02-11T16:47:46Z

P0 — VPC peering missing `export_custom_routes` (kubectl via VPN won't work)

File: terraform/infrastructure/main.tf (peering resources)

When GKE creates a private cluster, it peers a Google-managed VPC (containing the master) into the env VPC and imports the route automatically. However, for WireGuard clients to reach the master, that route needs to propagate through the WireGuard ↔ env peering.

The current peerings are bare:

resource "google_compute_network_peering" "dev_to_wireguard" {
  name         = "peering-dev-to-wireguard"
  network      = module.dev.vpc_self_link
  peer_network = module.wireguard.vpc_self_link
}

Without custom route exchange, the WireGuard VPC has no route to the GKE master private IP. kubectl over VPN will fail with connection timeouts.

Suggested fix: Add to each env→wireguard peering:

export_custom_routes = true

And to each wireguard→env peering:

import_custom_routes = true

harry-rhesis · 2026-02-11T16:48:04Z

P1 — Firewall rules `gke_nodes_to_master` and `gke_wireguard_to_master` are no-ops

File: terraform/infrastructure/modules/kubernetes/gcp/firewall.tf

These two rules are INGRESS rules (the default direction) with destination_ranges pointing at the master CIDR:

resource "google_compute_firewall" "gke_nodes_to_master" {
  ...
  source_ranges      = [var.node_cidr, var.pod_cidr]
  destination_ranges = [var.master_cidr]
}

The GKE master lives in a Google-managed peered VPC, not in your VPC. An ingress rule in your VPC with destination_ranges = master_cidr won't match any real traffic, because no packets arriving at VMs in your VPC will have a destination IP in the master range. These rules create a false sense of coverage but do nothing.

The actual control-plane access works because:

master_authorized_networks_config handles API authorization (correctly set)
GKE automatically manages firewall rules on the peered connection
Egress from nodes to master just works (since deny-all egress was removed)

Suggested fix: Remove these two rules entirely. They can't control traffic to a resource in another VPC. If you want to document the intent, a comment explaining that GKE manages control-plane connectivity would be clearer.

harry-rhesis · 2026-02-11T16:48:19Z

P1 — Deny-all egress removed without replacement + `gke_internal` too broad

Files: modules/network/gcp/firewall.tf, modules/kubernetes/gcp/firewall.tf

Two related firewall concerns:

1. Egress is now fully open. The deny-all egress rule was removed (with a helpful comment explaining why), and the comment says "Use specific allow/deny rules in the GKE module for fine-grained control." However, the GKE module doesn't define any egress rules. All egress is now GCP-default-allowed (0.0.0.0/0). This is a meaningful security posture change from the original baseline.

Suggested fix: Add specific egress allow rules in the GKE module for what nodes actually need (Google APIs via restricted VIP 199.36.153.4/30, control plane CIDR, Cloud NAT gateway) and keep a deny-all egress baseline at a lower priority. Or, if the intent is to rely on Cloud NAT + Private Google Access for now, document that explicitly.

2. gke_internal has no target_tags. This rule allows all TCP/UDP/ICMP from cluster CIDRs to every instance in the VPC, not just GKE nodes:

source_ranges = [var.node_cidr, var.pod_cidr, var.service_cidr]
# no target_tags = missing!

Suggested fix: Add target_tags = ["gke-${var.environment}"] to scope this to GKE node instances only.

harry-rhesis · 2026-02-11T16:48:35Z

P2 — Local state backend for production GKE

Files: infrastructure/main.tf, all envs/*/main.tf

All configs use backend "local" {}. Now that we're managing GKE clusters (especially prd with deletion_protection = true), local state brings real risk:

No state locking — concurrent terraform apply runs can corrupt state
State lives only on the machine that ran apply — if a laptop is lost, so is the state
No shared visibility — teammates can't see or plan against current infra
Orphaned resources if state is lost (GKE clusters, service accounts, IAM bindings)

Suggestion: This doesn't need to block the PR, but should be a fast follow-up before any production apply. A GCS bucket with versioning + state locking is the standard approach:

backend "gcs" {
  bucket = "rhesis-terraform-state"
  prefix = "infrastructure/dev"  # per-env prefix
}

harry-rhesis · 2026-02-11T16:48:47Z

P2 — No maintenance window or `master_global_access_config`

File: modules/kubernetes/gcp/cluster.tf

Two GKE cluster settings worth adding:

1. No maintenance_policy: Without this, GKE will auto-upgrade nodes and masters at any time. For staging and especially production, this can cause unexpected downtime.

maintenance_policy {
  recurring_window {
    start_time = "2026-01-01T04:00:00Z"
    end_time   = "2026-01-01T08:00:00Z"
    recurrence = "FREQ=WEEKLY;BYDAY=SA"  # Saturday 4-8 AM UTC
  }
}

2. No master_global_access_config: With enable_private_endpoint = true, the master is only reachable from the same region. If the WireGuard VPN server or any CI/CD runner is in a different region, kubectl won't connect.

private_cluster_config {
  ...
  master_global_access_config {
    enabled = true
  }
}

This is a quick add that avoids a hard-to-debug connectivity issue later.

harry-rhesis · 2026-02-11T16:48:57Z

P2 — CIDR values hardcoded in multiple places

Files: infrastructure/main.tf, envs/dev/main.tf, envs/stg/main.tf, envs/prd/main.tf

The same CIDR values are specified 3-4 times per environment. For example, 10.2.4.0/28 (dev master CIDR) appears in:

Root main.tf → network module
Root main.tf → GKE module
envs/dev/main.tf → network module
envs/dev/main.tf → GKE module

If any one of these drifts, the cluster or firewall rules will silently misconfigure.

Suggested fix: Use locals blocks to define CIDRs once, or better yet, have the GKE module read values from the network module's outputs. For example, the network module could output master_cidr, and the GKE module could consume it — single source of truth.

harry-rhesis · 2026-02-11T16:49:14Z

P3 — Minor cleanups

A few small items that can be addressed in this PR or as follow-ups:

1. Vestigial master_cidr in network module. The variable is still declared in modules/network/gcp/variables.tf and passed by all callers, but the master subnet resource was removed from subnets.tf. It's dead code now — removing it avoids confusion about whether a master subnet still exists.

2. Variable description stale. create_gke_subnets description says "nodes, ilb, master, pods, services" but master subnet is no longer created. Minor wording fix.

3. Azure module pins azurerm ~> 3.0. For a forward-looking placeholder in 2026, ~> 4.0 would be more current. Small thing, but avoids starting a new module on an older major version.

4. Network policy / Dataplane V2. Not a blocker, but for a security-focused private cluster setup, consider enabling dataplane_v2_config { enabled = true } (which includes built-in network policy support) as a follow-up. This would allow pod-to-pod traffic control inside the cluster.

5. Output formatting. Missing blank lines between some output blocks in envs/*/outputs.tf — cosmetic only.

harry-rhesis · 2026-02-11T16:55:56Z

P4 — Azure AKS module placeholder: missing network module and Private DNS consideration

Files: modules/kubernetes/azure/main.tf, modules/kubernetes/azure/outputs.tf, modules/kubernetes/azure/variables.tf

The Azure placeholder is a nice forward-looking addition, but a couple of things to flag for when this gets built out:

1. No corresponding modules/network/azure/ module. The GCP side has a full network module (vpc.tf, subnets.tf, firewall.tf, nat.tf, peering.tf) that the Kubernetes module depends on. Azure will need the same: VNet, subnets (nodes, pods if using CNI Overlay, ILB, Private Endpoints), NSGs, NAT Gateway, and VNet peering to the WireGuard VNet. Consider stubbing out modules/network/azure/ with TODOs to mirror the GCP structure, so the dependency is visible.

2. The TODO list is missing Private DNS — the hardest part of AKS private clusters. AKS private clusters require a Private DNS Zone (privatelink.<region>.azmk8s.io) linked to the WireGuard VNet so that kubectl can resolve the cluster's private FQDN. Without this, you hit the same problem as the GCP export_custom_routes gap — VPN connectivity works but name resolution fails. The cluster identity also needs DNS Zone Contributor permissions. This is worth adding to the TODO comments so it's not missed later.

3. Provider version should be ~> 4.0. AzureRM 4.x has been GA and 3.x→4.x introduced breaking changes (mandatory subscription_id in provider, removed deprecated resources). Starting a new module on 3.x means an immediate migration when it gets implemented.

Suggested update to the TODO comments:

# TODO: Azure AKS private cluster module (future implementation)
# Prerequisites:
# - modules/network/azure/ (VNet, subnets, NSGs, NAT Gateway, VNet peering)
# - Private DNS Zone (privatelink.<region>.azmk8s.io) linked to WireGuard VNet
# - User-Assigned Managed Identity with DNS Zone Contributor role
# Design: private API server, workload identity, system+user node pools,
#         integration with existing VNet, Private DNS for kubectl resolution.

Not a blocker for this PR — just enriching the breadcrumbs for whoever picks up the Azure track.

asadaaron · 2026-02-11T17:15:52Z

P0 — Standalone env deploys create unreachable GKE clusters

Files: envs/dev/main.tf, envs/stg/main.tf, envs/prd/main.tf

The env configs say "Standalone ... network (no peering). For full deploy with peerings run from infrastructure/" — but they now include the GKE module with enable_private_endpoint = true and only the WireGuard CIDR as an authorized network.

In a standalone deploy (no peering to the WireGuard VPC):

There is no route from WireGuard to this VPC

The only authorized network is the WireGuard CIDR

There is no public API endpoint (enable_private_endpoint = true)

This creates a cluster that nobody can reach — not via VPN (no peering), not via public internet (private only). The cluster becomes unmanageable immediately after creation.

Suggested fix: Either:

Don't include the GKE module in standalone env configs (keep GKE only in root main.tf where peerings exist), or

Add the node subnet CIDR as an additional authorized network so at least in-VPC access works, or

Add a variable like enable_public_endpoint that defaults to true in standalone mode and false in root mode

This ticket is related to create the isolated cluster for each dev, stg and prd (lower part of the image). I am working on the wiregurad peering to communicated to each of the cluster from using the wireguard. After completing the wireguard VPN, we will be able to communicate with all of the cluster as the defined role using wireguard vpn. Most of the fix will come to the next PR.

asadaaron added 3 commits February 10, 2026 13:27

kubernetes private cluster

338d9b4

fix: All cluster resources were brought up, but: only 0 nodes out of …

7d808e7

…3 have registered.

fix: include tfplan into the gitignore.

bf3eeeb

asadaaron self-assigned this Feb 10, 2026

asadaaron requested a review from harry-rhesis February 10, 2026 16:54

harry-rhesis changed the title ~~Feat/Kubernete Private Cluster~~ Feat/Kubernetes Private Cluster Feb 11, 2026

asadaaron marked this pull request as draft February 13, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/Kubernetes Private Cluster#1300

Feat/Kubernetes Private Cluster#1300
asadaaron wants to merge 3 commits intomainfrom
feat/kubernete-private-cluster

asadaaron commented Feb 10, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026

Uh oh!

harry-rhesis commented Feb 11, 2026 •

edited

Loading

Uh oh!

asadaaron commented Feb 11, 2026 •

edited

Loading

P0 — Standalone env deploys create unreachable GKE clusters

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asadaaron commented Feb 10, 2026

📝 Summary

📁 Files Changed ( 27 files)

📋 Commit Details

✅ Checklist

🧪 Testing

📸 Screenshots (if applicable)

🔗 Related Issues

Uh oh!

harry-rhesis commented Feb 11, 2026

Overall Review

Uh oh!

harry-rhesis commented Feb 11, 2026

P0 — Standalone env deploys create unreachable GKE clusters

Uh oh!

harry-rhesis commented Feb 11, 2026

P0 — VPC peering missing export_custom_routes (kubectl via VPN won't work)

Uh oh!

harry-rhesis commented Feb 11, 2026

P1 — Firewall rules gke_nodes_to_master and gke_wireguard_to_master are no-ops

Uh oh!

harry-rhesis commented Feb 11, 2026

P1 — Deny-all egress removed without replacement + gke_internal too broad

Uh oh!

harry-rhesis commented Feb 11, 2026

P2 — Local state backend for production GKE

Uh oh!

harry-rhesis commented Feb 11, 2026

P2 — No maintenance window or master_global_access_config

Uh oh!

harry-rhesis commented Feb 11, 2026

P2 — CIDR values hardcoded in multiple places

Uh oh!

harry-rhesis commented Feb 11, 2026

P3 — Minor cleanups

Uh oh!

harry-rhesis commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

P4 — Azure AKS module placeholder: missing network module and Private DNS consideration

Uh oh!

asadaaron commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

P0 — Standalone env deploys create unreachable GKE clusters

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

P0 — VPC peering missing `export_custom_routes` (kubectl via VPN won't work)

P1 — Firewall rules `gke_nodes_to_master` and `gke_wireguard_to_master` are no-ops

P1 — Deny-all egress removed without replacement + `gke_internal` too broad

P2 — No maintenance window or `master_global_access_config`

harry-rhesis commented Feb 11, 2026 •

edited

Loading

asadaaron commented Feb 11, 2026 •

edited

Loading