Skip to content

Feat/Kubernetes Private Cluster#1300

Draft
asadaaron wants to merge 3 commits intomainfrom
feat/kubernete-private-cluster
Draft

Feat/Kubernetes Private Cluster#1300
asadaaron wants to merge 3 commits intomainfrom
feat/kubernete-private-cluster

Conversation

@asadaaron
Copy link
Collaborator

This PR introduces changes from the feat/kubernete-private-cluster branch.

📝 Summary

📁 Files Changed ( 27 files)

.gitignore
terraform/infrastructure/envs/dev/main.tf
terraform/infrastructure/envs/dev/outputs.tf
terraform/infrastructure/envs/dev/variables.tf
terraform/infrastructure/envs/prd/main.tf
terraform/infrastructure/envs/prd/outputs.tf
terraform/infrastructure/envs/prd/variables.tf
terraform/infrastructure/envs/stg/main.tf
terraform/infrastructure/envs/stg/outputs.tf
terraform/infrastructure/envs/stg/variables.tf
terraform/infrastructure/main.tf
terraform/infrastructure/modules/kubernetes/azure/main.tf
terraform/infrastructure/modules/kubernetes/azure/outputs.tf
terraform/infrastructure/modules/kubernetes/azure/variables.tf
terraform/infrastructure/modules/kubernetes/gcp/cluster.tf
terraform/infrastructure/modules/kubernetes/gcp/firewall.tf
terraform/infrastructure/modules/kubernetes/gcp/main.tf
terraform/infrastructure/modules/kubernetes/gcp/node_pool.tf
terraform/infrastructure/modules/kubernetes/gcp/outputs.tf
terraform/infrastructure/modules/kubernetes/gcp/service_account.tf
terraform/infrastructure/modules/kubernetes/gcp/variables.tf
terraform/infrastructure/modules/network/gcp/firewall.tf
terraform/infrastructure/modules/network/gcp/main.tf
terraform/infrastructure/modules/network/gcp/nat.tf
terraform/infrastructure/modules/network/gcp/outputs.tf
terraform/infrastructure/modules/network/gcp/subnets.tf
terraform/infrastructure/outputs.tf

📋 Commit Details

bf3eeebf - fix: include tfplan into the gitignore. (Md Asaduzzaman Miah, 2026-02-10 13:26)
7d808e71 - fix: All cluster resources were brought up, but: only 0 nodes out of 3 have registered. (Md Asaduzzaman Miah, 2026-02-10 10:56)
338d9b44 - kubernetes private cluster (Md Asaduzzaman Miah, 2026-02-09 16:43)

✅ Checklist

  • Code follows the project's style guidelines
  • Self-review of code has been performed
  • Code is commented, particularly in hard-to-understand areas
  • Corresponding changes to documentation have been made
  • Tests have been added/updated for new functionality
  • All tests pass locally

🧪 Testing

📸 Screenshots (if applicable)

🔗 Related Issues

@asadaaron asadaaron self-assigned this Feb 10, 2026
@harry-rhesis
Copy link
Contributor

Overall Review

Great work on this PR, Asad! Setting up private GKE clusters with proper network isolation is non-trivial infrastructure work, and the foundations here are solid. The module structure is clean and well-organized (cluster.tf, node_pool.tf, firewall.tf, service_account.tf -- nice separation of concerns), the IP addressing scheme is well-planned with no CIDR overlaps, and I appreciate the thoughtful decisions like Workload Identity, shielded nodes, appropriate machine sizing per environment, and deletion_protection = true on prd.

The Cloud NAT addition and the correct removal of the master subnet (with the explanatory comment) show a good understanding of how GKE private clusters interact with the network layer.

I've left a few prioritized comments below -- some are things that could bite us in production if not addressed now, others are hardening suggestions and minor cleanups. Nothing here takes away from the quality of the overall design.

@harry-rhesis
Copy link
Contributor

P0 — Standalone env deploys create unreachable GKE clusters

Files: envs/dev/main.tf, envs/stg/main.tf, envs/prd/main.tf

The env configs say "Standalone ... network (no peering). For full deploy with peerings run from infrastructure/" — but they now include the GKE module with enable_private_endpoint = true and only the WireGuard CIDR as an authorized network.

In a standalone deploy (no peering to the WireGuard VPC):

  • There is no route from WireGuard to this VPC
  • The only authorized network is the WireGuard CIDR
  • There is no public API endpoint (enable_private_endpoint = true)

This creates a cluster that nobody can reach — not via VPN (no peering), not via public internet (private only). The cluster becomes unmanageable immediately after creation.

Suggested fix: Either:

  1. Don't include the GKE module in standalone env configs (keep GKE only in root main.tf where peerings exist), or
  2. Add the node subnet CIDR as an additional authorized network so at least in-VPC access works, or
  3. Add a variable like enable_public_endpoint that defaults to true in standalone mode and false in root mode

@harry-rhesis
Copy link
Contributor

P0 — VPC peering missing export_custom_routes (kubectl via VPN won't work)

File: terraform/infrastructure/main.tf (peering resources)

When GKE creates a private cluster, it peers a Google-managed VPC (containing the master) into the env VPC and imports the route automatically. However, for WireGuard clients to reach the master, that route needs to propagate through the WireGuard ↔ env peering.

The current peerings are bare:

resource "google_compute_network_peering" "dev_to_wireguard" {
  name         = "peering-dev-to-wireguard"
  network      = module.dev.vpc_self_link
  peer_network = module.wireguard.vpc_self_link
}

Without custom route exchange, the WireGuard VPC has no route to the GKE master private IP. kubectl over VPN will fail with connection timeouts.

Suggested fix: Add to each env→wireguard peering:

export_custom_routes = true

And to each wireguard→env peering:

import_custom_routes = true

@harry-rhesis
Copy link
Contributor

P1 — Firewall rules gke_nodes_to_master and gke_wireguard_to_master are no-ops

File: terraform/infrastructure/modules/kubernetes/gcp/firewall.tf

These two rules are INGRESS rules (the default direction) with destination_ranges pointing at the master CIDR:

resource "google_compute_firewall" "gke_nodes_to_master" {
  ...
  source_ranges      = [var.node_cidr, var.pod_cidr]
  destination_ranges = [var.master_cidr]
}

The GKE master lives in a Google-managed peered VPC, not in your VPC. An ingress rule in your VPC with destination_ranges = master_cidr won't match any real traffic, because no packets arriving at VMs in your VPC will have a destination IP in the master range. These rules create a false sense of coverage but do nothing.

The actual control-plane access works because:

  • master_authorized_networks_config handles API authorization (correctly set)
  • GKE automatically manages firewall rules on the peered connection
  • Egress from nodes to master just works (since deny-all egress was removed)

Suggested fix: Remove these two rules entirely. They can't control traffic to a resource in another VPC. If you want to document the intent, a comment explaining that GKE manages control-plane connectivity would be clearer.

@harry-rhesis
Copy link
Contributor

P1 — Deny-all egress removed without replacement + gke_internal too broad

Files: modules/network/gcp/firewall.tf, modules/kubernetes/gcp/firewall.tf

Two related firewall concerns:

1. Egress is now fully open. The deny-all egress rule was removed (with a helpful comment explaining why), and the comment says "Use specific allow/deny rules in the GKE module for fine-grained control." However, the GKE module doesn't define any egress rules. All egress is now GCP-default-allowed (0.0.0.0/0). This is a meaningful security posture change from the original baseline.

Suggested fix: Add specific egress allow rules in the GKE module for what nodes actually need (Google APIs via restricted VIP 199.36.153.4/30, control plane CIDR, Cloud NAT gateway) and keep a deny-all egress baseline at a lower priority. Or, if the intent is to rely on Cloud NAT + Private Google Access for now, document that explicitly.

2. gke_internal has no target_tags. This rule allows all TCP/UDP/ICMP from cluster CIDRs to every instance in the VPC, not just GKE nodes:

source_ranges = [var.node_cidr, var.pod_cidr, var.service_cidr]
# no target_tags = missing!

Suggested fix: Add target_tags = ["gke-${var.environment}"] to scope this to GKE node instances only.

@harry-rhesis
Copy link
Contributor

P2 — Local state backend for production GKE

Files: infrastructure/main.tf, all envs/*/main.tf

All configs use backend "local" {}. Now that we're managing GKE clusters (especially prd with deletion_protection = true), local state brings real risk:

  • No state locking — concurrent terraform apply runs can corrupt state
  • State lives only on the machine that ran apply — if a laptop is lost, so is the state
  • No shared visibility — teammates can't see or plan against current infra
  • Orphaned resources if state is lost (GKE clusters, service accounts, IAM bindings)

Suggestion: This doesn't need to block the PR, but should be a fast follow-up before any production apply. A GCS bucket with versioning + state locking is the standard approach:

backend "gcs" {
  bucket = "rhesis-terraform-state"
  prefix = "infrastructure/dev"  # per-env prefix
}

@harry-rhesis
Copy link
Contributor

P2 — No maintenance window or master_global_access_config

File: modules/kubernetes/gcp/cluster.tf

Two GKE cluster settings worth adding:

1. No maintenance_policy: Without this, GKE will auto-upgrade nodes and masters at any time. For staging and especially production, this can cause unexpected downtime.

maintenance_policy {
  recurring_window {
    start_time = "2026-01-01T04:00:00Z"
    end_time   = "2026-01-01T08:00:00Z"
    recurrence = "FREQ=WEEKLY;BYDAY=SA"  # Saturday 4-8 AM UTC
  }
}

2. No master_global_access_config: With enable_private_endpoint = true, the master is only reachable from the same region. If the WireGuard VPN server or any CI/CD runner is in a different region, kubectl won't connect.

private_cluster_config {
  ...
  master_global_access_config {
    enabled = true
  }
}

This is a quick add that avoids a hard-to-debug connectivity issue later.

@harry-rhesis
Copy link
Contributor

P2 — CIDR values hardcoded in multiple places

Files: infrastructure/main.tf, envs/dev/main.tf, envs/stg/main.tf, envs/prd/main.tf

The same CIDR values are specified 3-4 times per environment. For example, 10.2.4.0/28 (dev master CIDR) appears in:

  1. Root main.tf → network module
  2. Root main.tf → GKE module
  3. envs/dev/main.tf → network module
  4. envs/dev/main.tf → GKE module

If any one of these drifts, the cluster or firewall rules will silently misconfigure.

Suggested fix: Use locals blocks to define CIDRs once, or better yet, have the GKE module read values from the network module's outputs. For example, the network module could output master_cidr, and the GKE module could consume it — single source of truth.

@harry-rhesis
Copy link
Contributor

P3 — Minor cleanups

A few small items that can be addressed in this PR or as follow-ups:

1. Vestigial master_cidr in network module. The variable is still declared in modules/network/gcp/variables.tf and passed by all callers, but the master subnet resource was removed from subnets.tf. It's dead code now — removing it avoids confusion about whether a master subnet still exists.

2. Variable description stale. create_gke_subnets description says "nodes, ilb, master, pods, services" but master subnet is no longer created. Minor wording fix.

3. Azure module pins azurerm ~> 3.0. For a forward-looking placeholder in 2026, ~> 4.0 would be more current. Small thing, but avoids starting a new module on an older major version.

4. Network policy / Dataplane V2. Not a blocker, but for a security-focused private cluster setup, consider enabling dataplane_v2_config { enabled = true } (which includes built-in network policy support) as a follow-up. This would allow pod-to-pod traffic control inside the cluster.

5. Output formatting. Missing blank lines between some output blocks in envs/*/outputs.tf — cosmetic only.

@harry-rhesis
Copy link
Contributor

harry-rhesis commented Feb 11, 2026

P4 — Azure AKS module placeholder: missing network module and Private DNS consideration

Files: modules/kubernetes/azure/main.tf, modules/kubernetes/azure/outputs.tf, modules/kubernetes/azure/variables.tf

The Azure placeholder is a nice forward-looking addition, but a couple of things to flag for when this gets built out:

1. No corresponding modules/network/azure/ module. The GCP side has a full network module (vpc.tf, subnets.tf, firewall.tf, nat.tf, peering.tf) that the Kubernetes module depends on. Azure will need the same: VNet, subnets (nodes, pods if using CNI Overlay, ILB, Private Endpoints), NSGs, NAT Gateway, and VNet peering to the WireGuard VNet. Consider stubbing out modules/network/azure/ with TODOs to mirror the GCP structure, so the dependency is visible.

2. The TODO list is missing Private DNS — the hardest part of AKS private clusters. AKS private clusters require a Private DNS Zone (privatelink.<region>.azmk8s.io) linked to the WireGuard VNet so that kubectl can resolve the cluster's private FQDN. Without this, you hit the same problem as the GCP export_custom_routes gap — VPN connectivity works but name resolution fails. The cluster identity also needs DNS Zone Contributor permissions. This is worth adding to the TODO comments so it's not missed later.

3. Provider version should be ~> 4.0. AzureRM 4.x has been GA and 3.x→4.x introduced breaking changes (mandatory subscription_id in provider, removed deprecated resources). Starting a new module on 3.x means an immediate migration when it gets implemented.

Suggested update to the TODO comments:

# TODO: Azure AKS private cluster module (future implementation)
# Prerequisites:
# - modules/network/azure/ (VNet, subnets, NSGs, NAT Gateway, VNet peering)
# - Private DNS Zone (privatelink.<region>.azmk8s.io) linked to WireGuard VNet
# - User-Assigned Managed Identity with DNS Zone Contributor role
# Design: private API server, workload identity, system+user node pools,
#         integration with existing VNet, Private DNS for kubectl resolution.

Not a blocker for this PR — just enriching the breadcrumbs for whoever picks up the Azure track.

@harry-rhesis harry-rhesis changed the title Feat/Kubernete Private Cluster Feat/Kubernetes Private Cluster Feb 11, 2026
@asadaaron
Copy link
Collaborator Author

asadaaron commented Feb 11, 2026

P0 — Standalone env deploys create unreachable GKE clusters

Files: envs/dev/main.tf, envs/stg/main.tf, envs/prd/main.tf

The env configs say "Standalone ... network (no peering). For full deploy with peerings run from infrastructure/" — but they now include the GKE module with enable_private_endpoint = true and only the WireGuard CIDR as an authorized network.

In a standalone deploy (no peering to the WireGuard VPC):

  • There is no route from WireGuard to this VPC
  • The only authorized network is the WireGuard CIDR
  • There is no public API endpoint (enable_private_endpoint = true)

This creates a cluster that nobody can reach — not via VPN (no peering), not via public internet (private only). The cluster becomes unmanageable immediately after creation.

Suggested fix: Either:

  1. Don't include the GKE module in standalone env configs (keep GKE only in root main.tf where peerings exist), or
  2. Add the node subnet CIDR as an additional authorized network so at least in-VPC access works, or
  3. Add a variable like enable_public_endpoint that defaults to true in standalone mode and false in root mode
Screenshot 2026-02-11 at 18 07 47 This ticket is related to create the isolated cluster for each dev, stg and prd (lower part of the image). I am working on the wiregurad peering to communicated to each of the cluster from using the wireguard. After completing the wireguard VPN, we will be able to communicate with all of the cluster as the defined role using wireguard vpn. Most of the fix will come to the next PR.

@asadaaron asadaaron marked this pull request as draft February 13, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants